HTRC's Workset Builder tool offers advanced search and result filtering functionality in order to facilitate the creation HTRC worksets. Learn how about and how to use the tool here.

The HTRC Workset Builder 2.0 Beta for Extracted Features 2.0 is the next iteration of a new interface to the HTRC Extracted Features Dataset to enable both volume-level metadata search and volume- and page-level unigram (single word) text search of the extracted features in order to build worksets.

As indicated, this interface is currently in beta, and may change.

Quick Guide: How to Build a Workset

To build a workset using HTRC Workset Builder, follow these general steps:

Step 0: Decide what to search
Step 1: Perform a unigram (single-term) full text or metadata (using field tags) search at the desired level (page or volume)
Step 2: Filter through the results and select desired items to add to your workset
Step 3: Repeat steps 1 and 2 as necessary until your workset is ready;
Step 4: "Export as workset" to HTRC Analytics or download workset metadata.

More information about each of the steps in the workset building process are included in sections below.

Searching

Searches are not case sensitive, and by default, your search will be conducted on pages recognized as English. Click “Search all Languages” if you prefer to search everything. Users can also choose specific languages to limit your search to from those that appear under “Show other languages.” Limit your search to a specific part-of-speech by using the checkboxes under the language, though be aware that not all of the languages have the functionality to search by part-of-speech. Wildcard matching is possible using '?' for a single character and '*' for multiple characters. For example 'canad?' and '*land'.

There are four options for searching: text, metadata, combined and advanced. Text search will search the full text of the volume, at the page level, for a unigram or unigrams (e.g. searching all volumes for the word "rose"). Results returned are volume-level metadata, along with page-level metadata and bag-of-words tokens. Since this a page-based search, you will receive one result for each page that matches your query. To see results grouped by volume (multiple page results under one volume heading and one result), check the box marked "Sort &Group by Volume" under the search bar. Metadata search will search volume-level metadata fields for given unigrams, and return volumes in which the terms appear in a given (or any) field specified in the drop-down menu (e.g. searching all volumes for those with a publicaton place of "bl" the MARC code for Brazil). A Combined search allows both text and metadata search in a given query (e.g. a search to return all volumes published in Brazil in which the term "rose" appeared on a page). Advanced search allows for users familiar with Solr syntax (see below for more information) to construct and execute their own queries.

When searching the page text it is important to realize that every word you enter is treated as a separate term (a unigram) for the purposes of the query that is performed. Effectively phrase searching the page text is not possible. This is because Workset Builder is a search interface built on the Extracted Features Dataset where the sequential order of the words has been removed, effectively making it bag of words. The closest approximation is to use the AND operator, for example the query lawn AND tennis will return all pages where both words appear somewhere on the page. In the case of a hyphenated word, this is processed as single term, and so does present as a phrase in terms of indexing, for example the query "lawn tennis" (in quotes) will find pages where that term appears hyphenated. In the case of volume metadata search, the sequential order to the words is kept. This means phrase searching is possible across metadata.

Search Text

Text search allows users to query the full text, by page, for unigrams (single terms). By default, text searches will search English-language volumes. If you'd like to search all languages, check the "Search pages in all languages" button underneath the search bar. Currently, part-of-speech information is only available for volumes in English, German, Portuguese, Danish, Dutch and Swedish. While other languages are coded in volume metadata and thus can be retrieved, there will not be part-of-speech data available for those volumes.

Text searches will retrieve volume-level metadata, but the main unit of search and retrieval is the page. Since many pages in a single volume may contain a given unigram, users may wish to check the "Sort & Group by Volume" button directly beneath the search bar, which will present results by the volume, with a list of pages on which the term appears, as compared to multiple volume entries with a single associated page in the results view.

Search Metadata

Metadata search allows users to query the catalogue metadata associated with each volume in the corpus (aka volume metadata). Enter a single term, multiple terms, phrases (in quotes), or any combination thereof. By default, "All fields" is selected in the drop-down menu next to the search box. Click on the drop-down menu to select more specific fields to search by, such as Title.

To search multiple metadata fields, enter your search query in a format called Solr syntax. For example, a search for title_t:hamlet AND contributorName_t:shakespeare will return all volumes with “hamlet” in the title field and “shakespeare” in the contributorName field, the latter being the field used by the cataloguer to record a personal or corporate name associated with the volume. The same search can be used with an “OR” operand to return volumes that satisfy either condition. Note that there is no space between the colon and the search term. For more information, see this Solr query syntax guide.

For information on the volume metadata fields, including possible values for fields with controlled vocabularies see the below Metadata field values section.

Searching dates

Metadata date fields can be searched for using (primarily) 4-digit years, e.g. "1948". A variety of dates can be associated with an volume, such as its publication date (pubDate_i), and the date the digital form of the record was created (dateCreated_i). Being numeric, these fields are matched to "_i" fields, where the suffix indicates the indexed value is an integer (as opposed to "_t" for text).

To go beyond searching for a document from a particular year (e.g. pubDate_i:1948) to search over a date range, you need to use more complex Solr syntax (more information in the next section) to specify the numeric start and stop values. After the field to search by is specified, this takes the form [ <val> TO <val>]. For example, to search for volumes published 1880-1883 enter “pubDate_i:[1880 TO 1883]” in the Advanced search tab. Solr query syntax is case-sensitive, and so, the value range must be expressed as "TO" (and not, for example, "to").

Most recorded dates are numeric, however sometimes the exact date is not known when catalogued, and an "X", "-" or "?" as a digit is substituted to signify this, such as "194X", "194-" or "194?". For this reason "_t" versions of date-related fields also provided to allow for more thorough date searches to be expressed, albeit requiring more complex syntax. Note, all dates are mapped to the "_t" version of the field, including the fully numeric ones.

To continue with the previous example of searching for volumes published 1880-1883, then a more expansive form of the search would be: pubDate_t:188X OR pubDate_t:188\? OR pubDate_t:1880 OR pubDate_t:1881 OR pubDate_t: 1882 OR pubDate_t:1883. One could then look through the results, and make a determination on a case-by-case basis as to whether to leave in a 188?, 188- or 188X date in the result set, or press the 'x' icon in the corner to have it removed. In the event the search criteria is for publications in a given decade, including the '?', '-' and 'X' ones, then the needed query syntax gets simpler, and can be found with: pubDate_t:188?. Note, this time the "?" is not preceded with a backslash (\), meaning it is interpreted by Solr as the wildcard character for matching purposes, not the literal question mark symbol. More about this in the next section!

Volume Metadata

While Solr query syntax terms (met in the earlier sections, above) such as "AND", "OR", and "TO" are case-senstive—meaning they must be entered exactly as detailed—searching for volume-level metadata is case-insensitive: searching for Shakespeare with the term "SHAKESPEARE", "shakespeare" or even "ShakespEare" returns the same results. Solr query syntax supports wildcard matching using '?' for a single character and '*' for multiple characters. For example 'canad?' and '*land'. While the HTRC Workset Builder interface is designed for unigram (single-term) searching of page-text, it is possible for volume-level metdata to express phrase-based search. This is done using quotation marks: "term1 term2".

Via the "Metadata" tab you can enter terms directly into the search box, in which case the terms are matched against a common set of volume metadata fields. The main ones are: Title, Name, Genre, Type of Resource, Place of Publication. You can click on the "Show full query ..." button on the search results page to see what the complete list of fields are.

You can also search for specific volume metadata fields using the field:term form, such as contributorName_t:Austen or as a phrase-base search contributorName_t:"Austen, Jane".

In alphabetical order, searchable fields are as follows. In moving from Extracted Features 1.5 to 2.0, some of the indexed metadata fields have been changed. In the table below in cases where this has occurred the old (v1.5) name is displayed with ~~strikethrough~~ to explicitly mark the transition.

Field name	Field name in Solr syntax	Field description
Access Profile	accessProfile_t:	The code that indicates full-text access level.
Bibliographic Format	bibliographicFormat_t:	The code for the format of a volume (e.g. book, serial, etc.).
Date Created	dateCreated_i, dateCreated_t:	The time this metadata object was processed.
Genre	genre_t:	The genre of the volume.
Handle URL	handleUrl_t: htid_s:	The persistent identifier for the given volume.
HathiTrust Record Number	hathitrustRecordNumber_t:	The unique record number for the volume in the HathiTrust Digital Library.
HathiTrust Bib URL	htBibUrl_t: mainEntityOfPageCatalogRecord_s	The HathiTrust Bibliographic RESTful call for the volume metadata record. The provided URL delivers an HTML page. Adding ".json" on to the end of the URL delilvers the metadatda in JSON format.
Imprint	imprint_t: publisherName_t:	The place of publication, publisher, and publication date of the given volume.
Language	language_t:	The primary language of the volume in MARC language code format.
Last Update Date	lastUpdateDate_t: lastRightsUpdateDate_i:, lastRightsUpdateDate_t:	The date this page was last updated.
Names	names_t: contributorName_t:	The personal and corporate names associated with a volume.
OCLC	oclc_t:	The control number(s) assigned to each bibliographic record by the Online Computer Library Center (OCLC).
Publication Date	pubDate_i:, pubDate_t:	The publication year.
Publication Place	pubPlace_t: pubPlaceName_t:	The publication location code in MARC country code format.
Rights Attribute (Access Rights)	~~rightsAttributes_t:~~ accessRights_t	The rights attributes for a volume.
Schema Version	schemaVersion_t: schemaVersion_s:	A version identifier for the format and structure of this metadata object.
Source Institution	sourceInstitution_t:	The institution code of the original institution who contributed the volume.
Title	title_t:	Title of the volume.
Type of Resource	typeOfResource_t: typeOfResource_s:	The format type of a volume.
Volume Identifier	volumeIdentifier_t:	A unique identifier for the current volume. This is the same identifier used in the HathiTrust and HathiTrust Research Center corpora.

Results

On the results page, you will find the title and unique HathiTrust ID for each volume that contains a result based on your search. You can hover over the title of each volume to trigger a pop-up with brief metadata information. Also listed, if a text search is part of your query, is the page sequence on which your search term appears. Lastly, a link to download the Extracted Features data for each volume in your results is also generated. If you follow the link to the page sequence for your search term, the Extracted Features data–tokens in a variety of views, parts of speech, and token frequencies–for that page is shown, along with links to download the Extracted Features files for the page or volume, along with a thumbnail of the page image with a link back to HathiTrust to view the page directly. Additionally, below the volume title, the full metadata record in human-readable form is available, if you click the "Show metadata" link to expand the section.

Filtering

On your results page, you will see seven different fields that can be used to filter search results. These fields are derived from the same metadata fields listed above: genre, language, copyright status, author, place of publication, original bibliographic format and classification. To apply facets to filter results, check boxes next to the desired facets under a given heading (e.g. "author"), and then click the "Apply Filter" button that will appear next to the section heading. To filter by values in more than one field/heading, you must first choose to apply filters in one field before doing so in another.

Exporting results

Once you have a desired set of results on the search page, you may work with or save them in a number of ways. For result sets of less than 40 million pages, you may export the entire set of search results as: a list of volume or page IDs, a metadata manifest with one row per volume, or you may choose to download the Extracted Features files for each volume in your result set. When downloading Extracted Features files for result sets, be mindful that for many volumes, this will be a large download, which can take minutes to complete.

To create a workset from more than one search, you can add volumes you'd like to include to your shopping cart by checking boxes next to each volume and pressing the yellow "Add" button or by selecting volumes via check box and dragging and dropping them into the shopping cart icon at the top right on the result page. If you'd like to change the checkboxes for each item, you can use the "Select All On This Page" or "Deselect All" to either check or uncheck all results. Similarly, the "Invert Selection" button can be used to change all checked items to unchecked, or the inverse.

Once your shopping cart is complete with the volumes you're interested in, click on the cart icon to view your workset. From this page, you can directly import your shopping cart as a workset in HTRC Analytics by clicking the "Export as Workset" button at the top right of the shopping cart page. From here, you'll be taken to HTRC Analytics, prompted to sign in, if you aren't already, and asked to provide a name and description of your new workset. Once a workset is imported into Analytics, you can get metadata information, share the workset, and run algorithms over its contents.

Saving worksets

Since worksets created using the Workset Builder are tied to a web browser session, once you exit your browser, your workset will not be saved unless you export it in one of the above ways. For the same reason, worksets cannot be shared via URL unless they are imported into HTRC Analytics.

Metadata field values

The Bibliographic Format field is coded; for example using BK to represent a Book. The possible codes used for this field are:

BK:	Books	CF:	Computer Files	CR:	Continuing Resources	MP:	Maps
MU:	Music	MX:	Mixed Materials	SE:	Serials	VM:	Visual Materials

The Place of Publication field pubPlaceName_t uses MARC country codes, with the following possible values:

aa:	Albania	abc:	Alberta	aca:	Australian Capital Territory	ae:	Algeria
af:	Afghanistan	ag:	Argentina	ai:	Armenia (Republic)	aj:	Azerbaijan
aku:	Alaska	alu:	Alabama	am:	Anguilla	an:	Andorra
ao:	Angola	aq:	Antigua and Barbuda	aru:	Arkansas	as:	American Samoa
at:	Australia	au:	Austria	aw:	Aruba	ay:	Antarctica
azu:	Arizona	ba:	Bahrain	bb:	Barbados	bcc:	British Columbia
bd:	Burundi	be:	Belgium	bf:	Bahamas	bg:	Bangladesh
bh:	Belize	bi:	British Indian Ocean Territory	bl:	Brazil	bm:	Bermuda Islands
bn:	Bosnia and Herzegovina	bo:	Bolivia	bp:	Solomon Islands	br:	Burma
bs:	Botswana	bt:	Bhutan	bu:	Bulgaria	bv:	Bouvet Island
bw:	Belarus	bx:	Brunei	ca:	Caribbean Netherlands	cau:	California
cb:	Cambodia	cc:	China	cd:	Chad	ce:	Sri Lanka
cf:	Congo (Brazzaville)	cg:	Congo (Democratic Republic)	ch:	China (Republic : 1949- )	ci:	Croatia
cj:	Cayman Islands	ck:	Colombia	cl:	Chile	cm:	Cameroon
co:	Curaçao	cou:	Colorado	cq:	Comoros	cr:	Costa Rica
ctu:	Connecticut	cu:	Cuba	cv:	Cabo Verde	cw:	Cook Islands
cx:	Central African Republic	cy:	Cyprus	dcu:	District of Columbia	deu:	Delaware
dk:	Denmark	dm:	Benin	dq:	Dominica	dr:	Dominican Republic
ea:	Eritrea	ec:	Ecuador	eg:	Equatorial Guinea	em:	Timor-Leste
enk:	England	er:	Estonia	es:	El Salvador	et:	Ethiopia
fa:	Faroe Islands	fg:	French Guiana	fi:	Finland	fj:	Fiji
fk:	Falkland Islands	flu:	Florida	fm:	Micronesia (Federated States)	fp:	French Polynesia
fr:	France	fs:	Terres australes et antarctiques françaises	ft:	Djibouti	gau:	Georgia
gb:	Kiribati	gd:	Grenada	gh:	Ghana	gi:	Gibraltar
gl:	Greenland	gm:	Gambia	go:	Gabon	gp:	Guadeloupe
gr:	Greece	gs:	Georgia (Republic)	gt:	Guatemala	gu:	Guam
gv:	Guinea	gw:	Germany	gy:	Guyana	gz:	Gaza Strip
hiu:	Hawaii	hm:	Heard and McDonald Islands	ho:	Honduras	ht:	Haiti
hu:	Hungary	iau:	Iowa	ic:	Iceland	idu:	Idaho
ie:	Ireland	ii:	India	ilu:	Illinois	inu:	Indiana
io:	Indonesia	iq:	Iraq	ir:	Iran	is:	Israel
it:	Italy	iv:	Côte d'Ivoire	iy:	Iraq-Saudi Arabia Neutral Zone	ja:	Japan
ji:	Johnston Atoll	jm:	Jamaica	jo:	Jordan	ke:	Kenya
kg:	Kyrgyzstan	kn:	Korea (North)	ko:	Korea (South)	ksu:	Kansas
ku:	Kuwait	kv:	Kosovo	kyu:	Kentucky	kz:	Kazakhstan
lau:	Louisiana	lb:	Liberia	le:	Lebanon	lh:	Liechtenstein
li:	Lithuania	lo:	Lesotho	ls:	Laos	lu:	Luxembourg
lv:	Latvia	ly:	Libya	mau:	Massachusetts	mbc:	Manitoba
mc:	Monaco	mdu:	Maryland	meu:	Maine	mf:	Mauritius
mg:	Madagascar	miu:	Michigan	mj:	Montserrat	mk:	Oman
ml:	Mali	mm:	Malta	mnu:	Minnesota	mo:	Montenegro
mou:	Missouri	mp:	Mongolia	mq:	Martinique	mr:	Morocco
msu:	Mississippi	mtu:	Montana	mu:	Mauritania	mv:	Moldova
mw:	Malawi	mx:	Mexico	my:	Malaysia	mz:	Mozambique
nbu:	Nebraska	ncu:	North Carolina	ndu:	North Dakota	ne:	Netherlands
nfc:	Newfoundland and Labrador	ng:	Niger	nhu:	New Hampshire	nik:	Northern Ireland
nju:	New Jersey	nkc:	New Brunswick	nl:	New Caledonia	nmu:	New Mexico
nn:	Vanuatu	no:	Norway	np:	Nepal	nq:	Nicaragua
nr:	Nigeria	nsc:	Nova Scotia	ntc:	Northwest Territories	nu:	Nauru
nuc:	Nunavut	nvu:	Nevada	nw:	Northern Mariana Islands	nx:	Norfolk Island
nyu:	New York (State)	nz:	New Zealand	ohu:	Ohio	oku:	Oklahoma
onc:	Ontario	oru:	Oregon	ot:	Mayotte	pau:	Pennsylvania
pc:	Pitcairn Island	pe:	Peru	pf:	Paracel Islands	pg:	Guinea-Bissau
ph:	Philippines	pic:	Prince Edward Island	pk:	Pakistan	pl:	Poland
pn:	Panama	po:	Portugal	pp:	Papua New Guinea	pr:	Puerto Rico
pw:	Palau	py:	Paraguay	qa:	Qatar	qea:	Queensland
quc:	Québec (Province)	rb:	Serbia	re:	Réunion	rh:	Zimbabwe
riu:	Rhode Island	rm:	Romania	ru:	Russia (Federation)	rw:	Rwanda
sa:	South Africa	sc:	Saint-Barthélemy	scu:	South Carolina	sd:	South Sudan
sdu:	South Dakota	se:	Seychelles	sf:	Sao Tome and Principe	sg:	Senegal
sh:	Spanish North Africa	si:	Singapore	sj:	Sudan	sl:	Sierra Leone
sm:	San Marino	sn:	Sint Maarten	snc:	Saskatchewan	so:	Somalia
sp:	Spain	sq:	Swaziland	sr:	Surinam	ss:	Western Sahara
st:	Saint-Martin	stk:	Scotland	su:	Saudi Arabia	sw:	Sweden
sx:	Namibia	sy:	Syria	sz:	Switzerland	ta:	Tajikistan
tc:	Turks and Caicos Islands	tg:	Togo	th:	Thailand	ti:	Tunisia
tk:	Turkmenistan	tl:	Tokelau	tma:	Tasmania	tnu:	Tennessee
to:	Tonga	tr:	Trinidad and Tobago	ts:	United Arab Emirates	tu:	Turkey
tv:	Tuvalu	txu:	Texas	tz:	Tanzania	ua:	Egypt
uc:	United States Misc. Caribbean Islands	ug:	Uganda	uik:	United Kingdom Misc. Islands	un:	Ukraine
up:	United States Misc. Pacific Islands	utu:	Utah	uv:	Burkina Faso	uy:	Uruguay
uz:	Uzbekistan	vau:	Virginia	vb:	British Virgin Islands	vc:	Vatican City
ve:	Venezuela	vi:	Virgin Islands of the United States	vm:	Vietnam	vp:	Various places
vra:	Victoria	vtu:	Vermont	wau:	Washington (State)	wea:	Western Australia
wf:	Wallis and Futuna	wiu:	Wisconsin	wj:	West Bank of the Jordan River	wk:	Wake Island
wlk:	Wales	ws:	Samoa	wvu:	West Virginia	wyu:	Wyoming
xa:	Christmas Island (Indian Ocean)	xb:	Cocos (Keeling) Islands	xc:	Maldives	xd:	Saint Kitts-Nevis
xe:	Marshall Islands	xf:	Midway Islands	xga:	Coral Sea Islands Territory	xh:	Niue
xj:	Saint Helena	xk:	Saint Lucia	xl:	Saint Pierre and Miquelon	xm:	Saint Vincent and the Grenadines
xn:	Macedonia	xna:	New South Wales	xo:	Slovakia	xoa:	Northern Territory
xp:	Spratly Island	xr:	Czech Republic	xra:	South Australia	xs:	South Georgia and the South Sandwich Islands
xv:	Slovenia	xx:	No place	xxc:	Canada	xxk:	United Kingdom
xxu:	United States	ye:	Yemen	ykc:	Yukon Territory	za:	Zambi

Deprecated MARC Place of Publication codes, which are no longer being actively used, but may still appear in data, are:

-ac:	Ashmore and Cartier Islands	-ai:	Anguilla	-air:	Armenian S.S.R.	-ajr:	Azerbaijan S.S.R.
-bwr:	Byelorussian S.S.R.	-cn:	Canada	-cp:	Canton and Enderbury Islands	-cs:	Czechoslovakia
-cz:	Canal Zone	-err:	Estonia	-ge:	Germany (East)	-gn:	Gilbert and Ellice Islands
-gsr:	Georgian S.S.R.	-hk:	Hong Kong	-iu:	Israel-Syria Demilitarized Zones	-iw:	Israel-Jordan Demilitarized Zones
-jn:	Jan Mayen	-kgr:	Kirghiz S.S.R.	-kzr:	Kazakh S.S.R.	-lir:	Lithuania
-ln:	Central and Southern Line Islands	-lvr:	Latvia	-mh:	Macao	-mvr:	Moldavian S.S.R.
-na:	Netherlands Antilles	-nm:	Northern Mariana Islands	-pt:	Portuguese Timor	-rur:	Russian S.F.S.R.
-ry:	Ryukyu Islands, Southern	-sb:	Svalbard	-sk:	Sikkim	-sv:	Swan Islands
-tar:	Tajik S.S.R.	-tkr:	Turkmen S.S.R.	-tt:	Trust Territory of the Pacific Islands	-ui:	United Kingdom Misc. Islands
-uk:	United Kingdom	-unr:	Ukraine	-ur:	Soviet Union	-us:	United States
-uzr:	Uzbek S.S.R.	-vn:	Vietnam, North	-vs:	Vietnam, South	-wb:	West Berlin
-xi:	Saint Kitts-Nevis-Anguilla	-xxr:	Soviet Union	-ys:	Yemen (People's Democratic Republic)	-yu:	Serbia and Montenegro

The set of codes used for the Language field language_t are also derived from MARC language codes, with the following possible values:

aar:	Afar	abk:	Abkhaz	ace:	Achinese	ach:	Acoli
ada:	Adangme	ady:	Adygei	afa:	Afroasiatic (Other)	afh:	Afrihili (Artificial language)
afr:	Afrikaans	ain:	Ainu	aka:	Akan	akk:	Akkadian
alb:	Albanian	ale:	Aleut	alg:	Algonquian (Other)	alt:	Altai
amh:	Amharic	ang:	English, Old (ca. 450-1100)	anp:	Angika	apa:	Apache languages
ara:	Arabic	arc:	Aramaic	arg:	Aragonese	arm:	Armenian
arn:	Mapuche	arp:	Arapaho	art:	Artificial (Other)	arw:	Arawak
asm:	Assamese	ast:	Bable	ath:	Athapascan (Other)	aus:	Australian languages
ava:	Avaric	ave:	Avestan	awa:	Awadhi	aym:	Aymara
aze:	Azerbaijani	bad:	Banda languages	bai:	Bamileke languages	bak:	Bashkir
bal:	Baluchi	bam:	Bambara	ban:	Balinese	baq:	Basque
bas:	Basa	bat:	Baltic (Other)	bej:	Beja	bel:	Belarusian
bem:	Bemba	ben:	Bengali	ber:	Berber (Other)	bho:	Bhojpuri
bih:	Bihari (Other)	bik:	Bikol	bin:	Edo	bis:	Bislama
bla:	Siksika	bnt:	Bantu (Other)	bos:	Bosnian	bra:	Braj
bre:	Breton	btk:	Batak	bua:	Buriat	bug:	Bugis
bul:	Bulgarian	bur:	Burmese	byn:	Bilin	cad:	Caddo
cai:	Central American Indian (Other)	car:	Carib	cat:	Catalan	cau:	Caucasian (Other)
ceb:	Cebuano	cel:	Celtic (Other)	cha:	Chamorro	chb:	Chibcha
che:	Chechen	chg:	Chagatai	chi:	Chinese	chk:	Chuukese
chm:	Mari	chn:	Chinook jargon	cho:	Choctaw	chp:	Chipewyan
chr:	Cherokee	chu:	Church Slavic	chv:	Chuvash	chy:	Cheyenne
cmc:	Chamic languages	cop:	Coptic	cor:	Cornish	cos:	Corsican
cpe:	Creoles and Pidgins, English-based (Other)	cpf:	Creoles and Pidgins, French-based (Other)	cpp:	Creoles and Pidgins, Portuguese-based (Other)	cre:	Cree
crh:	Crimean Tatar	crp:	Creoles and Pidgins (Other)	csb:	Kashubian	cus:	Cushitic (Other)
cze:	Czech	dak:	Dakota	dan:	Danish	dar:	Dargwa
day:	Dayak	del:	Delaware	den:	Slavey	dgr:	Dogrib
din:	Dinka	div:	Divehi	doi:	Dogri	dra:	Dravidian (Other)
dsb:	Lower Sorbian	dua:	Duala	dum:	Dutch, Middle (ca. 1050-1350)	dut:	Dutch
dyu:	Dyula	dzo:	Dzongkha	efi:	Efik	egy:	Egyptian
eka:	Ekajuk	elx:	Elamite	eng:	English	enm:	English, Middle (1100-1500)
epo:	Esperanto	est:	Estonian	ewe:	Ewe	ewo:	Ewondo
fan:	Fang	fao:	Faroese	fat:	Fanti	fij:	Fijian
fil:	Filipino	fin:	Finnish	fiu:	Finno-Ugrian (Other)	fon:	Fon
fre:	French	frm:	French, Middle (ca. 1300-1600)	fro:	French, Old (ca. 842-1300)	frr:	North Frisian
frs:	East Frisian	fry:	Frisian	ful:	Fula	fur:	Friulian
gaa:	Gã	gay:	Gayo	gba:	Gbaya	gem:	Germanic (Other)
geo:	Georgian	ger:	German	gez:	Ethiopic	gil:	Gilbertese
gla:	Scottish Gaelic	gle:	Irish	glg:	Galician	glv:	Manx
gmh:	German, Middle High (ca. 1050-1500)	goh:	German, Old High (ca. 750-1050)	gon:	Gondi	gor:	Gorontalo
got:	Gothic	grb:	Grebo	grc:	Greek, Ancient (to 1453)	gre:	Greek, Modern (1453-)
grn:	Guarani	gsw:	Swiss German	guj:	Gujarati	gwi:	Gwich'in
hai:	Haida	hat:	Haitian French Creole	hau:	Hausa	haw:	Hawaiian
heb:	Hebrew	her:	Herero	hil:	Hiligaynon	him:	Western Pahari languages
hin:	Hindi	hit:	Hittite	hmn:	Hmong	hmo:	Hiri Motu
hrv:	Croatian	hsb:	Upper Sorbian	hun:	Hungarian	hup:	Hupa
iba:	Iban	ibo:	Igbo	ice:	Icelandic	ido:	Ido
iii:	Sichuan Yi	ijo:	Ijo	iku:	Inuktitut	ile:	Interlingue
ilo:	Iloko	ina:	Interlingua (International Auxiliary Language Association)	inc:	Indic (Other)	ind:	Indonesian
ine:	Indo-European (Other)	inh:	Ingush	ipk:	Inupiaq	ira:	Iranian (Other)
iro:	Iroquoian (Other)	ita:	Italian	jav:	Javanese	jbo:	Lojban (Artificial language)
jpn:	Japanese	jpr:	Judeo-Persian	jrb:	Judeo-Arabic	kaa:	Kara-Kalpak
kab:	Kabyle	kac:	Kachin	kal:	Kalâtdlisut	kam:	Kamba
kan:	Kannada	kar:	Karen languages	kas:	Kashmiri	kau:	Kanuri
kaw:	Kawi	kaz:	Kazakh	kbd:	Kabardian	kha:	Khasi
khi:	Khoisan (Other)	khm:	Khmer	kho:	Khotanese	kik:	Kikuyu
kin:	Kinyarwanda	kir:	Kyrgyz	kmb:	Kimbundu	kok:	Konkani
kom:	Komi	kon:	Kongo	kor:	Korean	kos:	Kosraean
kpe:	Kpelle	krc:	Karachay-Balkar	krl:	Karelian	kro:	Kru (Other)
kru:	Kurukh	kua:	Kuanyama	kum:	Kumyk	kur:	Kurdish
kut:	Kootenai	lad:	Ladino	lah:	Lahndā	lam:	Lamba (Zambia and Congo)
lao:	Lao	lat:	Latin	lav:	Latvian	lez:	Lezgian
lim:	Limburgish	lin:	Lingala	lit:	Lithuanian	lol:	Mongo-Nkundu
loz:	Lozi	ltz:	Luxembourgish	lua:	Luba-Lulua	lub:	Luba-Katanga
lug:	Ganda	lui:	Luiseño	lun:	Lunda	luo:	Luo (Kenya and Tanzania)
lus:	Lushai	mac:	Macedonian	mad:	Madurese	mag:	Magahi
mah:	Marshallese	mai:	Maithili	mak:	Makasar	mal:	Malayalam
man:	Mandingo	mao:	Maori	map:	Austronesian (Other)	mar:	Marathi
mas:	Maasai	may:	Malay	mdf:	Moksha	mdr:	Mandar
men:	Mende	mga:	Irish, Middle (ca. 1100-1550)	mic:	Micmac	min:	Minangkabau
mis:	Miscellaneous languages	mkh:	Mon-Khmer (Other)	mlg:	Malagasy	mlt:	Maltese
mnc:	Manchu	mni:	Manipuri	mno:	Manobo languages	moh:	Mohawk
mon:	Mongolian	mos:	Mooré	mul:	Multiple languages	mun:	Munda (Other)
mus:	Creek	mwl:	Mirandese	mwr:	Marwari	myn:	Mayan languages
myv:	Erzya	nah:	Nahuatl	nai:	North American Indian (Other)	nap:	Neapolitan Italian
nau:	Nauru	nav:	Navajo	nbl:	Ndebele (South Africa)	nde:	Ndebele (Zimbabwe)
ndo:	Ndonga	nds:	Low German	nep:	Nepali	new:	Newari
nia:	Nias	nic:	Niger-Kordofanian (Other)	niu:	Niuean	nno:	Norwegian (Nynorsk)
nob:	Norwegian (Bokmål)	nog:	Nogai	non:	Old Norse	nor:	Norwegian
nqo:	N'Ko	nso:	Northern Sotho	nub:	Nubian languages	nwc:	Newari, Old
nya:	Nyanja	nym:	Nyamwezi	nyn:	Nyankole	nyo:	Nyoro
nzi:	Nzima	oci:	Occitan (post-1500)	oji:	Ojibwa	ori:	Oriya
orm:	Oromo	osa:	Osage	oss:	Ossetic	ota:	Turkish, Ottoman
oto:	Otomian languages	paa:	Papuan (Other)	pag:	Pangasinan	pal:	Pahlavi
pam:	Pampanga	pan:	Panjabi	pap:	Papiamento	pau:	Palauan
peo:	Old Persian (ca. 600-400 B.C.)	per:	Persian	phi:	Philippine (Other)	phn:	Phoenician
pli:	Pali	pol:	Polish	pon:	Pohnpeian	por:	Portuguese
pra:	Prakrit languages	pro:	Provençal (to 1500)	pus:	Pushto	que:	Quechua
raj:	Rajasthani	rap:	Rapanui	rar:	Rarotongan	roa:	Romance (Other)
roh:	Raeto-Romance	rom:	Romani	rum:	Romanian	run:	Rundi
rup:	Aromanian	rus:	Russian	sad:	Sandawe	sag:	Sango (Ubangi Creole)
sah:	Yakut	sai:	South American Indian (Other)	sal:	Salishan languages	sam:	Samaritan Aramaic
san:	Sanskrit	sas:	Sasak	sat:	Santali	scn:	Sicilian Italian
sco:	Scots	sel:	Selkup	sem:	Semitic (Other)	sga:	Irish, Old (to 1100)
sgn:	Sign languages	shn:	Shan	sid:	Sidamo	sin:	Sinhalese
sio:	Siouan (Other)	sit:	Sino-Tibetan (Other)	sla:	Slavic (Other)	slo:	Slovak
slv:	Slovenian	sma:	Southern Sami	sme:	Northern Sami	smi:	Sami
smj:	Lule Sami	smn:	Inari Sami	smo:	Samoan	sms:	Skolt Sami
sna:	Shona	snd:	Sindhi	snk:	Soninke	sog:	Sogdian
som:	Somali	son:	Songhai	sot:	Sotho	spa:	Spanish
srd:	Sardinian	srn:	Sranan	srp:	Serbian	srr:	Serer
ssa:	Nilo-Saharan (Other)	ssw:	Swazi	suk:	Sukuma	sun:	Sundanese
sus:	Susu	sux:	Sumerian	swa:	Swahili	swe:	Swedish
syc:	Syriac	syr:	Syriac, Modern	tah:	Tahitian	tai:	Tai (Other)
tam:	Tamil	tat:	Tatar	tel:	Telugu	tem:	Temne
ter:	Terena	tet:	Tetum	tgk:	Tajik	tgl:	Tagalog
tha:	Thai	tib:	Tibetan	tig:	Tigré	tir:	Tigrinya
tiv:	Tiv	tkl:	Tokelauan	tlh:	Klingon (Artificial language)	tli:	Tlingit
tmh:	Tamashek	tog:	Tonga (Nyasa)	ton:	Tongan	tpi:	Tok Pisin
tsi:	Tsimshian	tsn:	Tswana	tso:	Tsonga	tuk:	Turkmen
tum:	Tumbuka	tup:	Tupi languages	tur:	Turkish	tut:	Altaic (Other)
tvl:	Tuvaluan	twi:	Twi	tyv:	Tuvinian	udm:	Udmurt
uga:	Ugaritic	uig:	Uighur	ukr:	Ukrainian	umb:	Umbundu
und:	Undetermined	urd:	Urdu	uzb:	Uzbek	vai:	Vai
ven:	Venda	vie:	Vietnamese	vol:	Volapük	vot:	Votic
wak:	Wakashan languages	wal:	Wolayta	war:	Waray	was:	Washoe
wel:	Welsh	wen:	Sorbian (Other)	wln:	Walloon	wol:	Wolof
xal:	Oirat	xho:	Xhosa	yao:	Yao (Africa)	yap:	Yapese
yid:	Yiddish	yor:	Yoruba	ypk:	Yupik languages	zap:	Zapotec
zbl:	Blissymbolics	zen:	Zenaga	zha:	Zhuang	znd:	Zande languages
zul:	Zulu	zun:	Zuni	zxx:	No linguistic content	zza:	Zaz

Deprecated language codes, which may still appear in metadata records, are:

-ajm:	Aljamía	-cam:	Khmer	-esk:	Eskimo languages	-esp:	Esperanto
-eth:	Ethiopic	-far:	Faroese	-fri:	Frisian	-gae:	Scottish Gaelix
-gag:	Galician	-gal:	Oromo	-gua:	Guarani	-int:	Interlingua (International Auxiliary Language Association)
-iri:	Irish	-kus:	Kusaie	-lan:	Occitan (post 1500)	-lap:	Sami
-max:	Manx	-mla:	Malagasy	-mol:	Moldavian	-sao:	Samoan
-scc:	Serbian	-scr:	Croatian	-sho:	Shona	-snh:	Sinhalese
-sso:	Sotho	-swz:	Swazi	-tag:	Tagalog	-taj:	Tajik
-tar:	Tatar	-tru:	Truk	-tsw:	Tswana

The set of codes used for the Copyright field accessRights_t are:

cc-by-3.0:	CC BY 3.0	cc-by-4.0:	CC BY 4.0	cc-by-nc-3.0:	CC BY-NC 3.0
cc-by-nc-4.0:	CC BY-NC 4.0	cc-by-nc-nd-3.0:	CC BY-NC-ND 3.0	cc-by-nc-nd-4.0:	CC BY-NC-ND 4.0
cc-by-nc-sa-3.0:	CC BY-NC-SA 3.0	cc-by-nc-sa-4.0:	CC BY-NC-SA 4.0	cc-by-nd-3.0:	CC BY-ND 3.0
cc-by-nd-4.0:	CC BY-ND 4.0	cc-by-sa-3.0:	CC BY-SA 3.0	cc-by-sa-4.0:	CC BY-SA 4.0
cc-zero:	CC Zero	ic:	In-copyright	ic-world:	In-copyright (world viewable)
icus:	US copyright	nobody:	Blocked	op:	Out-of-print
orph:	Copyright-orphaned	orphcand:	Orphan	pd:	Public domain
pd-pvt:	Access limited	pdus:	Public domain in US only	supp:	Suppressed from view
und:	Undetermined	und-world:	Undetermined

Example Solr JSON Records

An example record returned from SOLR is a useful way to see what fields are indexed, and from that fashion your query terms you enter into the Workset Builder interface. At the volume level, here is one of the records returned for the query that searches for: title_t:"USITC publication"

 {
        "lastRightsUpdateDate_t":["20190729"],
        "lastRightsUpdateDate_s":"20190729",
        "schemaVersion_s":"https://schemas.hathitrust.org/EF_Schema_MetadataSubSchema_v_3.0",
        "genre_ss":["http://id.loc.gov/vocabulary/marcgt/doc",
          "http://id.loc.gov/vocabulary/marcgt/gov"],
        "typeOfResource_s":"http://id.loc.gov/ontologies/bibframe/Text",
        "language_ss":["eng"],
        "accessRights_t":["pd"],
        "oclc_t":["4263346"],
        "accessRights_s":"pd",
        "htid_s":"http://hdl.handle.net/2027/mdp.39015084961401",
        "pubPlaceId_ss":["http://id.loc.gov/vocabulary/countries/dcu"],
        "pubDate_t":["197X"],
        "publisherName_t":["U.S. Govt. Print. Off."],
        "htid_t":["http://hdl.handle.net/2027/mdp.39015084961401"],
        "lastRightsUpdateDate_i":20190729,
        "title_s":"USITC publication /",
        "title_t":["USITC publication /"],
        "id":"mdp.39015084961401",
        "mainEntityOfPageCatalogRecord_s":"https://catalog.hathitrust.org/Record/003922761",
        "pubPlaceType_ss":["http://id.loc.gov/ontologies/bibframe/Place"],
        "sourceInstitution_t":["MIU"],
        "contributorName_t":["United States International Trade Commission."],
        "sourceInstitution_s":"MIU",
        "contributorType_ss":["http://id.loc.gov/ontologies/bibframe/Organization"],
        "publisherId_ss":["http://catalogdata.library.illinois.edu/lod/entities/ProvisionActivityAgent/ht/U.S.%20Govt.%20Print.%20Off."],
        "pubPlaceName_ss":["District of Columbia"],
        "pubDate_s":"197X",
        "publisherName_ss":["U.S. Govt. Print. Off."],
        "accessProfile_s":"google",
        "pubPlaceName_t":["District of Columbia"],
        "dateCreated_i":20200209,
        "contributorName_ss":["United States International Trade Commission."],
        "accessProfile_t":["google"],
        "publisherType_ss":["http://id.loc.gov/ontologies/bibframe/Organization"],
        "contributorId_ss":["http://www.viaf.org/viaf/126322615"],
        "language_t":["eng"],
        "oclc_ss":["4263346"],
        "dateCreated_t":["20200209"],
        "dateCreated_s":"20200209",
        "bibliographicFormat_t":["PublicationVolume"],
        "bibliographicFormat_s":"PublicationVolume",
        "_version_":1676455128102600705
}

The equivalent page-level search, to return all page-based Solr indexed records, where the title of the volume the page comes from is "USITC publiction", the query would be: volumetitle_txt:"USITC publication". Note the addition of the prefix to the field 'volume': this helps separate out volume-only metadata searching from page-level based searching when it is combined with volume metadata. Note also the change of suffix from "_t" (used when searching only volume-level metadta) to "_txt" when searching at the page level: the latter suffix (_txt_ is very similar to the former (_t), only it does not get stored in the Solr index. There is no need to store it with the page-level record, as it can be retrieved when needed from the volume metadata recrod.

An example record returned by this query is as follows:

{
        "volumeid_s":"ien.35556029988656",
        "id":"ien.35556029988656.page-000038",
        "volumedateCreated_i":20200209,
        "volumelastRightsUpdateDate_i":20170115,
        "volumebibliographicFormat_htrcstring":"PublicationVolume",
        "volumepubDate_htrcstring":"197X",
        "volumecontributorId_htrcstrings":["http://www.viaf.org/viaf/158040275"],
        "volumepublisherName_htrcstrings":["U.S. Govt."],
        "volumelanguage_htrcstrings":["eng"],
        "volumetypeOfResource_htrcstring":"http://id.loc.gov/ontologies/bibframe/Text",
        "volumepubPlaceType_htrcstrings":["http://id.loc.gov/ontologies/bibframe/Place"],
        "volumecontributorName_htrcstrings":["United States. National Transportation Safety Board."],
        "volumeoclc_htrcstrings":["6371936"],
        "volumegenre_htrcstrings":["http://id.loc.gov/vocabulary/marcgt/doc",
          "http://id.loc.gov/vocabulary/marcgt/gov"],
        "volumecontributorType_htrcstrings":["http://id.loc.gov/ontologies/bibframe/Jurisdiction"],
        "volumedateCreated_htrcstring":"20200209",
        "volumemainEntityOfPageCatalogRecord_htrcstring":"https://catalog.hathitrust.org/Record/002137135431181-4",
        "volumetitle_htrcstring":"Aircraft accident report /",
        "volumeschemaVersion_htrcstring":"https://schemas.hathitrust.org/EF_Schema_MetadataSubSchema_v_3.0",
        "volumepublisherId_htrcstrings":["http://catalogdata.library.illinois.edu/lod/entities/ProvisionActivityAgent/ht/U.S.%20Govt."],
        "volumelastRightsUpdateDate_htrcstring":"20170115",
        "volumepubPlaceName_htrcstrings":["District of Columbia"],
        "volumepubPlaceId_htrcstrings":["http://id.loc.gov/vocabulary/countries/dcu"],
        "volumesourceInstitution_htrcstring":"NWU",
        "volumepublisherType_htrcstrings":["http://id.loc.gov/ontologies/bibframe/Organization"],
        "volumeaccessProfile_htrcstring":"google",
        "_version_":1676440719503392768,
        "volumehtid_htrcstring":"http://hdl.handle.net/2027/ien.35556029988656",
        "volumeaccessRights_htrcstring":"pd"
}

Documentation

HTRC Workset Builder 2.0 (Beta) for Extracted Features 2.0