Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Excerpt

HTRC's Workset Builder tool offers advanced search and result filtering functionality in order to facilitate the creation HTRC worksets. Learn how about and how to use the tool here.


The HTRC Workset Builder 2.0 Beta for Extracted Features 2.0 is the next iteration of a new interface to the HTRC Extracted Features Dataset to enable both volume-level metadata search and volume- and page-level unigram (single word) text search of the extracted features in order to build worksets.

...

Anchor
searching
searching
Searching

You can search for unigrams (single terms) in both volume metadata or in the text of each volume, by page. Since this is a search built on the Extracted Features Dataset, bigram and larger n-gram searches (phrases) are not possible in the conventional sense. Instead, you can search for phrases using quotations (e.g. "snow ski"), which will return volumes and pages where each term in the query co-occur. In this way, a search for "snow ski" is equivalent to a Solr syntax search of "snow" AND "ski". See more details about Solr syntax, and a link to a guide, in the section below.

Searches are not case sensitive, and by default, your search will be conducted on pages recognized as English. Click “Search all Searches are not case sensitive, and by default, your search will be conducted on pages recognized as English. Click “Search all Languages” if you prefer to search everything. Users can also choose specific languages to limit your search to from those that appear under “Show other languages.” Limit your search to a specific part-of-speech by using the checkboxes under the language, though be aware that not all of the languages have the functionality to search by part-of-speech. Wildcard matching is possible using '?' for a single character and '*' for multiple characters. For example 'canad?' and '*land'.

There are four options for searching: text, metadata, combined and advanced. Text search will search the full text of the volume, at the page level, for a unigram or unigrams (e.g. searching all volumes for the word "rose"). Results returned are volume-level metadata, along with page-level metadata and bag-of-words tokens. Since this a page-based search, you will receive one result for each page that matches your query. To see results grouped by volume (multiple page results under one volume heading and one result), check the box marked "Sort &Group by Volume" under the search bar. Metadata search will search volume-level metadata fields for given unigrams, and return volumes in which the terms appear in a given (or any) field specified in the drop-down menu (e.g. searching all volumes for those with a publicaton place of "bl" the MARC code for Brazil). A combined Combined search allows both text and metadata search in a given query (e.g. a search to return all volumes published in Brazil in which the term "rose" appeared on a page). Advanced search allows for users familiar with Solr syntax (see below for more information) to construct and execute their own queries.

When searching the page text it is important to realize that every word you enter is treated as a separate term (a unigram) for the purposes of the query that is performed. Effectively phrase searching the page text is not possible.  This is because Workset Builder is a search interface built on the Extracted Features Dataset where the sequential order of the words has been removed, effectively making it bag of words. The closest approximation is to use the AND operator, for example the query lawn AND tennis will return all pages where both words appear somewhere on the page.  In the case of a hyphenated word, this is processed as single term, and so does present as a phrase in terms of indexing, for example the query "lawn tennis" (in quotes) will find pages where that term appears hyphenated. In the case of volume metadata search, the sequential order to the words is kept.  This means phrase searching is possible across metadata.

Search Text

Text search allows users to search volume query the full text, by page, for unigrams (single terms). A version of phrase searching can be achieved using the same method as described under Search Metadata: using quotation marks to initiate a search for multiple terms on the same page of text. By default, text searches will search English-language volumes. If you'd like to search all languages, check the "Search pages in all languages" button underneath the search bar. Currently, part-of-speech information is only available for volumes in English, German, Portuguese, Danish, Dutch and Swedish. While other languages are coded in volume metadata and thus can be retrieved, there will not be part-of-speech data available for those volumes.

...

Search Metadata

Metadata search is similar to text search, supporting unigram queries across all metadata fields. Search for a single term in all fields by choosing “All Fields” from the drop-down (this is the default metadata search), or search a specific field by selecting it from the menuallows users to query the catalogue metadata associated with each volume in the corpus (aka volume metadata). Enter a single term, multiple terms, phrases (in quotes), or any combination thereof.  By default, "All fields" is selected in the drop-down menu next to the search box.  Click on the drop-down menu to select more specific fields to search by, such as Title.

To search multiple metadata fields, enter your search query in a format called Solr syntax. For example, a search for “titletitle_t:hamlet AND contributorName_t:shakespeare” shakespeare will return all volumes with “hamlet” in the title field and “shakespeare” in the contributorName field, the latter being the field being used by the cataloger cataloguer to record a personal or corporate name associated with the volume. The same search can be used with an “OR” operand to return volumes that satisfy either condition. Note that there is no space between the colon and the search term. For more information, see this Solr query syntax guide.

...

ISSN
Field nameField name in Solr syntaxField description
Access ProfileaccessProfile_t:The code that indicates full-text access level.
Bibliographic FormatbibliographicFormat_t:The code for the format of a volume (e.g. book, serial, etc.).
Classification DDCclassification_ddc_t:The Dewey Decimal Classification call number supplied by the originating library.
Classification LCCclassification_lcc_t:The Library of Congress Classification call number supplied by the originating library. book, serial, etc.).
Date CreateddateCreated_i, dateCreated_t:The time this metadata object was processed.  
Genregenre_t:The genre of the volume.
Handle URL

handleUrl_t:

htid_s:

The persistent identifier for the given volume.
HathiTrust Record NumberhathitrustRecordNumber_t:The unique record number for the volume in the HathiTrust Digital Library.
HathiTrust Bib URL

htBibUrl_t:

mainEntityOfPageCatalogRecord_s

The HathiTrust Bibliographic RESTful call for the volume metadata record.
The provided URL delivers an HTML page. Adding ".json" on to the end of the URL delilvers the metadatda in JSON format.
Imprint

imprint_t:

publisherName_t:

The place of publication, publisher, and publication date of the given volume.ISBNisbn_t:The International Standard Book Number for a volume.issn_t:The International Standard Serial Number for a volume.
Issuanceissuance_t:The bibliographic level of a volume
Languagelanguage_t:The primary language of the volume in MARC language code format.

Last Update Date

lastUpdateDate_t:

lastRightsUpdateDate_i:,
lastRightsUpdateDate_t:

The date this page was last updated.
LCCNlccn_t:The Library of Congress Call Number for a volume.
Names

names_t:

contributorName_t:

The personal and corporate names associated with a volume.
OCLCoclc_t:The control number(s) assigned to each bibliographic record by the Online Computer Library Center (OCLC).
Publication DatepubDate_i:, pubDate_t:The publication year.
Publication Place

pubPlace_t:

pubPlaceName_t:

The publication location code in MARC country code format.
Rights Attribute (Access Rights)

rightsAttributes_t:

accessRights_t

The rights attributes for a volume.
Schema Version

schemaVersion_t:

schemaVersion_s:

A version identifier for the format and structure of this metadata object.Source Institution Record NumbersourceInstitutionRecordNumber_t:The unique record number for the volume from its original institution.
Source InstitutionsourceInstitution_t:The institution code of the original institution who contributed the volume.
Titletitle_t:Title of the volume.
Type of Resource

typeOfResource_t:

typeOfResource_s:

The format type of a volume.
Volume IdentifiervolumeIdentifier_t:A unique identifier for the current volume. This is the same identifier used in the HathiTrust and HathiTrust Research Center corpora.

...

On your results page, you will see seven different fields that can be used to filter search results. These fields are dervied derived from the same metadata fields listed above: genre, language, copyright status, author, place of publication, original bibliographic format and classification. To apply facets to filter results, check boxes next to the desired facets under a given heading (e.g. "author"), and then click the "Apply Filter" button that will appear next to the section heading. To filter by values in more than one field/heading, you must first choose to apply filters in one field before doing so in another.

...