...
Rachel Brekhus, University of Missouri
Matt WilkinsWilkens, University of Notre Dame
...
Discussion summary notes:
From Matt WilkinsWilkens: “As a user, one of the first things I think I will be looking to do, would be to pull all the feature counts into a database, Has there been any thought on the part of HTRC about putting the feature counts directly into a database instead of into a file?”
Discussion: Ultimately, HTRC is thinking of providing (some type of) API(s). So, instead of downloading files, access would be provided with API(s') access.
Peter asked: “How much interest may there be in engineering a more complex API? What other things can we (prioritize to) do with the data? Would the likely eventual size of the data (multiple gigabytes of data per decade), when we scale up to the contents of the entire HTRC corpus, end up exceeding what most regular people can download to,or work with, on their desktop machines? Of course, people may want to do things large-scale. Would it be reasonable or unreasonable to push that effort on to the user?”
Discussion: You can actually rsync on the individual volumes as well as the grouped tar files. However, the volume IDs have to be filename friendly, which HT volume IDs are not. We could provide a mapping from the HT volume IDs to filename-friendly IDs. There are reserved characters used in HT/HTRC volume IDs — characters that should not be used as part of names in the file system. A volumeID is not equivalent to a filename.
...
“Can we do prepared filename lists, for predictable kinds of worksets? Can we organize/ split-up datasets by genre? An argument for doing so is that, for example, fiction may be a high-demand-for-download genre compared to other genres.”
Discussion: We debated about how to group the datasets, and finally we decided to use year/decade/chronological type of information for grouping the datasets — mainly because this would be less controversial. For example, questions like “What counts as fiction?” could be highly controversial — the present grouping by chronology lets us bypass that controversy. However, this is not wholly unproblematic either — as (e.g.) a book listed as having been published in 1916 could well be a reprint of something published in 1872, etc.
“This is going to be a very new kind of thing for many users. How to describe to users how to use this tool? What kind of documentation is planned?” Rachael Rachel asks for examples and documentation.
...
Using the extracted features twill require some effort on the part of the user. The number of users who have been waiting to try to do this is likely to be small. The bottleneck is not so much the HTRC, but rather the smallness of the number of users.
Discussion: When people will see what they can do with the features, and when they see that it is tractable, then there will be more acceptance for this.
From RachaelRachel: There’s a tool (“Paper Machine”) that came out recently from the American Sociological Association [?] They are doing some extraction, from government documents about policy statements. They are trying to do data visualization, some kind of word cloud out of it.
Discussion: Are government documents in the HT corpus always clearly marked as such? Not sure.
...