HTRC Workset Builder 2.0 (Beta) for Extracted Features 2.0
HTRC's Workset Builder tool offers advanced search and result filtering functionality in order to facilitate the creation HTRC worksets. Learn how about and how to use the tool here.
The HTRC Workset Builder 2.0 Beta for Extracted Features 2.0 is the next iteration of a new interface to the HTRC Extracted Features Dataset to enable both volume-level metadata search and volume- and page-level unigram (single word) text search of the extracted features in order to build worksets.
As indicated, this interface is currently in beta, and may change.
Quick Guide: How to Build a Workset
To build a workset using HTRC Workset Builder, follow these general steps:
Step 0: Decide what to search
Step 1: Perform a unigram (single-term) full text or metadata (using field tags) search at the desired level (page or volume)
Step 2: Filter through the results and select desired items to add to your workset
Step 3: Repeat steps 1 and 2 as necessary until your workset is ready;
Step 4: "Export as workset" to HTRC Analytics or download workset metadata.
More information about each of the steps in the workset building process are included in sections below.
Searching
Searches are not case sensitive, and by default, your search will be conducted on pages recognized as English. Click “Search all Languages” if you prefer to search everything. Users can also choose specific languages to limit your search to from those that appear under “Show other languages.” Limit your search to a specific part-of-speech by using the checkboxes under the language, though be aware that not all of the languages have the functionality to search by part-of-speech. Wildcard matching is possible using '?' for a single character and '*' for multiple characters. For example 'canad?' and '*land'.
There are four options for searching: text, metadata, combined and advanced. Text search will search the full text of the volume, at the page level, for a unigram or unigrams (e.g. searching all volumes for the word "rose"). Results returned are volume-level metadata, along with page-level metadata and bag-of-words tokens. Since this a page-based search, you will receive one result for each page that matches your query. To see results grouped by volume (multiple page results under one volume heading and one result), check the box marked "Sort &Group by Volume" under the search bar. Metadata search will search volume-level metadata fields for given unigrams, and return volumes in which the terms appear in a given (or any) field specified in the drop-down menu (e.g. searching all volumes for those with a publicaton place of "bl" the MARC code for Brazil). A Combined search allows both text and metadata search in a given query (e.g. a search to return all volumes published in Brazil in which the term "rose" appeared on a page). Advanced search allows for users familiar with Solr syntax (see below for more information) to construct and execute their own queries.
When searching the page text it is important to realize that every word you enter is treated as a separate term (a unigram) for the purposes of the query that is performed. Effectively phrase searching the page text is not possible. This is because Workset Builder is a search interface built on the Extracted Features Dataset where the sequential order of the words has been removed, effectively making it bag of words. The closest approximation is to use the AND operator, for example the query lawn AND tennis will return all pages where both words appear somewhere on the page. In the case of a hyphenated word, this is processed as single term, and so does present as a phrase in terms of indexing, for example the query "lawn tennis" (in quotes) will find pages where that term appears hyphenated. In the case of volume metadata search, the sequential order to the words is kept. This means phrase searching is possible across metadata.
Search Text
Text search allows users to query the full text, by page, for unigrams (single terms). By default, text searches will search English-language volumes. If you'd like to search all languages, check the "Search pages in all languages" button underneath the search bar. Currently, part-of-speech information is only available for volumes in English, German, Portuguese, Danish, Dutch and Swedish. While other languages are coded in volume metadata and thus can be retrieved, there will not be part-of-speech data available for those volumes.
Text searches will retrieve volume-level metadata, but the main unit of search and retrieval is the page. Since many pages in a single volume may contain a given unigram, users may wish to check the "Sort & Group by Volume" button directly beneath the search bar, which will present results by the volume, with a list of pages on which the term appears, as compared to multiple volume entries with a single associated page in the results view.
Search Metadata
Metadata search allows users to query the catalogue metadata associated with each volume in the corpus (aka volume metadata). Enter a single term, multiple terms, phrases (in quotes), or any combination thereof. By default, "All fields" is selected in the drop-down menu next to the search box. Click on the drop-down menu to select more specific fields to search by, such as Title.
To search multiple metadata fields, enter your search query in a format called Solr syntax. For example, a search for title_t:hamlet AND contributorName_t:shakespeare will return all volumes with “hamlet” in the title field and “shakespeare” in the contributorName field, the latter being the field used by the cataloguer to record a personal or corporate name associated with the volume. The same search can be used with an “OR” operand to return volumes that satisfy either condition. Note that there is no space between the colon and the search term. For more information, see this Solr query syntax guide.
For information on the volume metadata fields, including possible values for fields with controlled vocabularies see the below Metadata field values section.
Searching dates
Metadata date fields can be searched for using (primarily) 4-digit years, e.g. "1948". A variety of dates can be associated with an volume, such as its publication date (pubDate_i), and the date the digital form of the record was created (dateCreated_i). Being numeric, these fields are matched to "_i" fields, where the suffix indicates the indexed value is an integer (as opposed to "_t" for text).
To go beyond searching for a document from a particular year (e.g. pubDate_i:1948) to search over a date range, you need to use more complex Solr syntax (more information in the next section) to specify the numeric start and stop values. After the field to search by is specified, this takes the form [ <val> TO <val>]. For example, to search for volumes published 1880-1883 enter “pubDate_i:[1880 TO 1883]” in the Advanced search tab. Solr query syntax is case-sensitive, and so, the value range must be expressed as "TO" (and not, for example, "to").
Most recorded dates are numeric, however sometimes the exact date is not known when catalogued, and an "X", "-" or "?" as a digit is substituted to signify this, such as "194X", "194-" or "194?". For this reason "_t" versions of date-related fields also provided to allow for more thorough date searches to be expressed, albeit requiring more complex syntax. Note, all dates are mapped to the "_t" version of the field, including the fully numeric ones.
To continue with the previous example of searching for volumes published 1880-1883, then a more expansive form of the search would be: pubDate_t:188X OR pubDate_t:188\? OR pubDate_t:1880 OR pubDate_t:1881 OR pubDate_t: 1882 OR pubDate_t:1883. One could then look through the results, and make a determination on a case-by-case basis as to whether to leave in a 188?, 188- or 188X date in the result set, or press the 'x' icon in the corner to have it removed. In the event the search criteria is for publications in a given decade, including the '?', '-' and 'X' ones, then the needed query syntax gets simpler, and can be found with: pubDate_t:188?. Note, this time the "?" is not preceded with a backslash (\), meaning it is interpreted by Solr as the wildcard character for matching purposes, not the literal question mark symbol. More about this in the next section!
Volume Metadata
While Solr query syntax terms (met in the earlier sections, above) such as "AND", "OR", and "TO" are case-senstive—meaning they must be entered exactly as detailed—searching for volume-level metadata is case-insensitive: searching for Shakespeare with the term "SHAKESPEARE", "shakespeare" or even "ShakespEare" returns the same results. Solr query syntax supports wildcard matching using '?' for a single character and '*' for multiple characters. For example 'canad?' and '*land'. While the HTRC Workset Builder interface is designed for unigram (single-term) searching of page-text, it is possible for volume-level metdata to express phrase-based search. This is done using quotation marks: "term1 term2".
Via the "Metadata" tab you can enter terms directly into the search box, in which case the terms are matched against a common set of volume metadata fields. The main ones are: Title, Name, Genre, Type of Resource, Place of Publication. You can click on the "Show full query ..." button on the search results page to see what the complete list of fields are.
You can also search for specific volume metadata fields using the field:term form, such as contributorName_t:Austen or as a phrase-base search contributorName_t:"Austen, Jane".
In alphabetical order, searchable fields are as follows. In moving from Extracted Features 1.5 to 2.0, some of the indexed metadata fields have been changed. In the table below in cases where this has occurred the old (v1.5) name is displayed with strikethrough to explicitly mark the transition.
Field name | Field name in Solr syntax | Field description |
Access Profile | accessProfile_t: | The code that indicates full-text access level. |
Bibliographic Format | bibliographicFormat_t: | The code for the format of a volume (e.g. book, serial, etc.). |
Date Created | dateCreated_i, dateCreated_t: | The time this metadata object was processed. |
Genre | genre_t: | The genre of the volume. |
Handle URL |
htid_s: | The persistent identifier for the given volume. |
HathiTrust Record Number | hathitrustRecordNumber_t: | The unique record number for the volume in the HathiTrust Digital Library. |
HathiTrust Bib URL |
mainEntityOfPageCatalogRecord_s | The HathiTrust Bibliographic RESTful call for the volume metadata record. The provided URL delivers an HTML page. Adding ".json" on to the end of the URL delilvers the metadatda in JSON format. |
Imprint |
publisherName_t: | The place of publication, publisher, and publication date of the given volume. |
Language | language_t: | The primary language of the volume in MARC language code format. |
Last Update Date |
lastRightsUpdateDate_i:, | The date this page was last updated. |
Names |
contributorName_t: | The personal and corporate names associated with a volume. |
OCLC | oclc_t: | The control number(s) assigned to each bibliographic record by the Online Computer Library Center (OCLC). |
Publication Date | pubDate_i:, pubDate_t: | The publication year. |
Publication Place |
pubPlaceName_t: | The publication location code in MARC country code format. |
Rights Attribute (Access Rights) |
accessRights_t | The rights attributes for a volume. |
Schema Version |
schemaVersion_s: | A version identifier for the format and structure of this metadata object. |
Source Institution | sourceInstitution_t: | The institution code of the original institution who contributed the volume. |
Title | title_t: | Title of the volume. |
Type of Resource |
typeOfResource_s: | The format type of a volume. |
Volume Identifier | volumeIdentifier_t: | A unique identifier for the current volume. This is the same identifier used in the HathiTrust and HathiTrust Research Center corpora. |
Results
On the results page, you will find the title and unique HathiTrust ID for each volume that contains a result based on your search. You can hover over the title of each volume to trigger a pop-up with brief metadata information. Also listed, if a text search is part of your query, is the page sequence on which your search term appears. Lastly, a link to download the Extracted Features data for each volume in your results is also generated. If you follow the link to the page sequence for your search term, the Extracted Features data–tokens in a variety of views, parts of speech, and token frequencies–for that page is shown, along with links to download the Extracted Features files for the page or volume, along with a thumbnail of the page image with a link back to HathiTrust to view the page directly. Additionally, below the volume title, the full metadata record in human-readable form is available, if you click the "Show metadata" link to expand the section.
Filtering
On your results page, you will see seven different fields that can be used to filter search results. These fields are derived from the same metadata fields listed above: genre, language, copyright status, author, place of publication, original bibliographic format and classification. To apply facets to filter results, check boxes next to the desired facets under a given heading (e.g. "author"), and then click the "Apply Filter" button that will appear next to the section heading. To filter by values in more than one field/heading, you must first choose to apply filters in one field before doing so in another.
Exporting results
Once you have a desired set of results on the search page, you may work with or save them in a number of ways. For result sets of less than 40 million pages, you may export the entire set of search results as: a list of volume or page IDs, a metadata manifest with one row per volume, or you may choose to download the Extracted Features files for each volume in your result set. When downloading Extracted Features files for result sets, be mindful that for many volumes, this will be a large download, which can take minutes to complete.
To create a workset from more than one search, you can add volumes you'd like to include to your shopping cart by checking boxes next to each volume and pressing the yellow "Add" button or by selecting volumes via check box and dragging and dropping them into the shopping cart icon at the top right on the result page. If you'd like to change the checkboxes for each item, you can use the "Select All On This Page" or "Deselect All" to either check or uncheck all results. Similarly, the "Invert Selection" button can be used to change all checked items to unchecked, or the inverse.
Once your shopping cart is complete with the volumes you're interested in, click on the cart icon to view your workset. From this page, you can directly import your shopping cart as a workset in HTRC Analytics by clicking the "Export as Workset" button at the top right of the shopping cart page. From here, you'll be taken to HTRC Analytics, prompted to sign in, if you aren't already, and asked to provide a name and description of your new workset. Once a workset is imported into Analytics, you can get metadata information, share the workset, and run algorithms over its contents.
Saving worksets
Since worksets created using the Workset Builder are tied to a web browser session, once you exit your browser, your workset will not be saved unless you export it in one of the above ways. For the same reason, worksets cannot be shared via URL unless they are imported into HTRC Analytics.
Metadata field values
The Bibliographic Format field is coded; for example using BK to represent a Book. The possible codes used for this field are:
BK: | Books | CF: | Computer Files | CR: | Continuing Resources | MP: | Maps |
MU: | Music | MX: | Mixed Materials | SE: | Serials | VM: | Visual Materials |
The Place of Publication field pubPlaceName_t uses MARC country codes, with the following possible values:
aa: | Albania | abc: | Alberta | aca: | Australian Capital Territory | ae: | Algeria |
af: | Afghanistan | ag: | Argentina | ai: | Armenia (Republic) | aj: | Azerbaijan |
aku: | Alaska | alu: | Alabama | am: | Anguilla | an: | Andorra |
ao: | Angola | aq: | Antigua and Barbuda | aru: | Arkansas | as: | American Samoa |
at: | Australia | au: | Austria | aw: | Aruba | ay: | Antarctica |
azu: | Arizona | ba: | Bahrain | bb: | Barbados | bcc: | British Columbia |
bd: | Burundi | be: | Belgium | bf: | Bahamas | bg: | Bangladesh |
bh: | Belize | bi: | British Indian Ocean Territory | bl: | Brazil | bm: | Bermuda Islands |
bn: | Bosnia and Herzegovina | bo: | Bolivia | bp: | Solomon Islands | br: | Burma |
bs: | Botswana | bt: | Bhutan | bu: | Bulgaria | bv: | Bouvet Island |
bw: | Belarus | bx: | Brunei | ca: | Caribbean Netherlands | cau: | California |
cb: | Cambodia | cc: | China | cd: | Chad | ce: | Sri Lanka |
cf: | Congo (Brazzaville) | cg: | Congo (Democratic Republic) | ch: | China (Republic : 1949- ) | ci: | Croatia |
cj: | Cayman Islands | ck: | Colombia | cl: | Chile | cm: | Cameroon |
co: | Curaçao | cou: | Colorado | cq: | Comoros | cr: | Costa Rica |
ctu: | Connecticut | cu: | Cuba | cv: | Cabo Verde | cw: | Cook Islands |
cx: | Central African Republic | cy: | Cyprus | dcu: | District of Columbia | deu: | Delaware |
dk: | Denmark | dm: | Benin | dq: | Dominica | dr: | Dominican Republic |
ea: | Eritrea | ec: | Ecuador | eg: | Equatorial Guinea | em: | Timor-Leste |
enk: | England | er: | Estonia | es: | El Salvador | et: | Ethiopia |
fa: | Faroe Islands | fg: | French Guiana | fi: | Finland | fj: | Fiji |
fk: | Falkland Islands | flu: | Florida | fm: | Micronesia (Federated States) | fp: | French Polynesia |
fr: | France | fs: | Terres australes et antarctiques françaises | ft: | Djibouti | gau: | Georgia |
gb: | Kiribati | gd: | Grenada | gh: | Ghana | gi: | Gibraltar |
gl: | Greenland | gm: | Gambia | go: | Gabon | gp: | Guadeloupe |
gr: | Greece | gs: | Georgia (Republic) | gt: | Guatemala | gu: | Guam |
gv: | Guinea | gw: | Germany | gy: | Guyana | gz: | Gaza Strip |
hiu: | Hawaii | hm: | Heard and McDonald Islands | ho: | Honduras | ht: | Haiti |
hu: | Hungary | iau: | Iowa | ic: | Iceland | idu: | Idaho |
ie: | Ireland | ii: | India | ilu: | Illinois | inu: | Indiana |
io: | Indonesia | iq: | Iraq | ir: | Iran | is: | Israel |
it: | Italy | iv: | Côte d'Ivoire | iy: | Iraq-Saudi Arabia Neutral Zone | ja: | Japan |
ji: | Johnston Atoll | jm: | Jamaica | jo: | Jordan | ke: | Kenya |
kg: | Kyrgyzstan | kn: | Korea (North) | ko: | Korea (South) | ksu: | Kansas |
ku: | Kuwait | kv: | Kosovo | kyu: | Kentucky | kz: | Kazakhstan |
lau: | Louisiana | lb: | Liberia | le: | Lebanon | lh: | Liechtenstein |
li: | Lithuania | lo: | Lesotho | ls: | Laos | lu: | Luxembourg |
lv: | Latvia | ly: | Libya | mau: | Massachusetts | mbc: | Manitoba |
mc: | Monaco | mdu: | Maryland | meu: | Maine | mf: | Mauritius |
mg: | Madagascar | miu: | Michigan | mj: | Montserrat | mk: | Oman |
ml: | Mali | mm: | Malta | mnu: | Minnesota | mo: | Montenegro |
mou: | Missouri | mp: | Mongolia | mq: | Martinique | mr: | Morocco |
msu: | Mississippi | mtu: | Montana | mu: | Mauritania | mv: | Moldova |
mw: | Malawi | mx: | Mexico | my: | Malaysia | mz: | Mozambique |
nbu: | Nebraska | ncu: | North Carolina | ndu: | North Dakota | ne: | Netherlands |
nfc: | Newfoundland and Labrador | ng: | Niger | nhu: | New Hampshire | nik: | Northern Ireland |
nju: | New Jersey | nkc: | New Brunswick | nl: | New Caledonia | nmu: | New Mexico |
nn: | Vanuatu | no: | Norway | np: | Nepal | nq: | Nicaragua |
nr: | Nigeria | nsc: | Nova Scotia | ntc: | Northwest Territories | nu: | Nauru |
nuc: | Nunavut | nvu: | Nevada | nw: | Northern Mariana Islands | nx: | Norfolk Island |
nyu: | New York (State) | nz: | New Zealand | ohu: | Ohio | oku: | Oklahoma |
onc: | Ontario | oru: | Oregon | ot: | Mayotte | pau: | Pennsylvania |
pc: | Pitcairn Island | pe: | Peru | pf: | Paracel Islands | pg: | Guinea-Bissau |
ph: | Philippines | pic: | Prince Edward Island | pk: | Pakistan | pl: | Poland |
pn: | Panama | po: | Portugal | pp: | Papua New Guinea | pr: | Puerto Rico |
pw: | Palau | py: | Paraguay | qa: | Qatar | qea: | Queensland |
quc: | Québec (Province) | rb: | Serbia | re: | Réunion | rh: | Zimbabwe |
riu: | Rhode Island | rm: | Romania | ru: | Russia (Federation) | rw: | Rwanda |
sa: | South Africa | sc: | Saint-Barthélemy | scu: | South Carolina | sd: | South Sudan |
sdu: | South Dakota | se: | Seychelles | sf: | Sao Tome and Principe | sg: | Senegal |
sh: | Spanish North Africa | si: | Singapore | sj: | Sudan | sl: | Sierra Leone |
sm: | San Marino | sn: | Sint Maarten | snc: | Saskatchewan | so: | Somalia |
sp: | Spain | sq: | Swaziland | sr: | Surinam | ss: | Western Sahara |
st: | Saint-Martin | stk: | Scotland | su: | Saudi Arabia | sw: | Sweden |
sx: | Namibia | sy: | Syria | sz: | Switzerland | ta: | Tajikistan |
tc: | Turks and Caicos Islands | tg: | Togo | th: | Thailand | ti: | Tunisia |
tk: | Turkmenistan | tl: | Tokelau | tma: | Tasmania | tnu: | Tennessee |
to: | Tonga | tr: | Trinidad and Tobago | ts: | United Arab Emirates | tu: | Turkey |
tv: | Tuvalu | txu: | Texas | tz: | Tanzania | ua: | Egypt |
uc: | United States Misc. Caribbean Islands | ug: | Uganda | uik: | United Kingdom Misc. Islands | un: | Ukraine |
up: | United States Misc. Pacific Islands | utu: | Utah | uv: | Burkina Faso | uy: | Uruguay |
uz: | Uzbekistan | vau: | Virginia | vb: | British Virgin Islands | vc: | Vatican City |
ve: | Venezuela | vi: | Virgin Islands of the United States | vm: | Vietnam | vp: | Various places |
vra: | Victoria | vtu: | Vermont | wau: | Washington (State) | wea: | Western Australia |
wf: | Wallis and Futuna | wiu: | Wisconsin | wj: | West Bank of the Jordan River | wk: | Wake Island |
wlk: | Wales | ws: | Samoa | wvu: | West Virginia | wyu: | Wyoming |
xa: | Christmas Island (Indian Ocean) | xb: | Cocos (Keeling) Islands | xc: | Maldives | xd: | Saint Kitts-Nevis |
xe: | Marshall Islands | xf: | Midway Islands | xga: | Coral Sea Islands Territory | xh: | Niue |
xj: | Saint Helena | xk: | Saint Lucia | xl: | Saint Pierre and Miquelon | xm: | Saint Vincent and the Grenadines |
xn: | Macedonia | xna: | New South Wales | xo: | Slovakia | xoa: | Northern Territory |
xp: | Spratly Island | xr: | Czech Republic | xra: | South Australia | xs: | South Georgia and the South Sandwich Islands |
xv: | Slovenia | xx: | No place | xxc: | Canada | xxk: | United Kingdom |
xxu: | United States | ye: | Yemen | ykc: | Yukon Territory | za: | Zambi |
Deprecated MARC Place of Publication codes, which are no longer being actively used, but may still appear in data, are:
-ac: | Ashmore and Cartier Islands | -ai: | Anguilla | -air: | Armenian S.S.R. | -ajr: | Azerbaijan S.S.R. |
-bwr: | Byelorussian S.S.R. | -cn: | Canada | -cp: | Canton and Enderbury Islands | -cs: | Czechoslovakia |
-cz: | Canal Zone | -err: | Estonia | -ge: | Germany (East) | -gn: | Gilbert and Ellice Islands |
-gsr: | Georgian S.S.R. | -hk: | Hong Kong | -iu: | Israel-Syria Demilitarized Zones | -iw: | Israel-Jordan Demilitarized Zones |
-jn: | Jan Mayen | -kgr: | Kirghiz S.S.R. | -kzr: | Kazakh S.S.R. | -lir: | Lithuania |
-ln: | Central and Southern Line Islands | -lvr: | Latvia | -mh: | Macao | -mvr: | Moldavian S.S.R. |
-na: | Netherlands Antilles | -nm: | Northern Mariana Islands | -pt: | Portuguese Timor | -rur: | Russian S.F.S.R. |
-ry: | Ryukyu Islands, Southern | -sb: | Svalbard | -sk: | Sikkim | -sv: | Swan Islands |
-tar: | Tajik S.S.R. | -tkr: | Turkmen S.S.R. | -tt: | Trust Territory of the Pacific Islands | -ui: | United Kingdom Misc. Islands |
-uk: | United Kingdom | -unr: | Ukraine | -ur: | Soviet Union | -us: | United States |
-uzr: | Uzbek S.S.R. | -vn: | Vietnam, North | -vs: | Vietnam, South | -wb: | West Berlin |
-xi: | Saint Kitts-Nevis-Anguilla | -xxr: | Soviet Union | -ys: | Yemen (People's Democratic Republic) | -yu: | Serbia and Montenegro |
The set of codes used for the Language field language_t are also derived from MARC language codes, with the following possible values:
aar: | Afar | abk: | Abkhaz | ace: | Achinese | ach: | Acoli |
ada: | Adangme | ady: | Adygei | afa: | Afroasiatic (Other) | afh: | Afrihili (Artificial language) |
afr: | Afrikaans | ain: | Ainu | aka: | Akan | akk: | Akkadian |
alb: | Albanian | ale: | Aleut | alg: | Algonquian (Other) | alt: | Altai |
amh: | Amharic | ang: | English, Old (ca. 450-1100) | anp: | Angika | apa: | Apache languages |
ara: | Arabic | arc: | Aramaic | arg: | Aragonese | arm: | Armenian |
arn: | Mapuche | arp: | Arapaho | art: | Artificial (Other) | arw: | Arawak |
asm: | Assamese | ast: | Bable | ath: | Athapascan (Other) | aus: | Australian languages |
ava: | Avaric | ave: | Avestan | awa: | Awadhi | aym: | Aymara |
aze: | Azerbaijani | bad: | Banda languages | bai: | Bamileke languages | bak: | Bashkir |
bal: | Baluchi | bam: | Bambara | ban: | Balinese | baq: | Basque |
bas: | Basa | bat: | Baltic (Other) | bej: | Beja | bel: | Belarusian |
bem: | Bemba | ben: | Bengali | ber: | Berber (Other) | bho: | Bhojpuri |
bih: | Bihari (Other) | bik: | Bikol | bin: | Edo | bis: | Bislama |
bla: | Siksika | bnt: | Bantu (Other) | bos: | Bosnian | bra: | Braj |
bre: | Breton | btk: | Batak | bua: | Buriat | bug: | Bugis |
bul: | Bulgarian | bur: | Burmese | byn: | Bilin | cad: | Caddo |
cai: | Central American Indian (Other) | car: | Carib | cat: | Catalan | cau: | Caucasian (Other) |
ceb: | Cebuano | cel: | Celtic (Other) | cha: | Chamorro | chb: | Chibcha |
che: | Chechen | chg: | Chagatai | chi: | Chinese | chk: | Chuukese |
chm: | Mari | chn: | Chinook jargon | cho: | Choctaw | chp: | Chipewyan |
chr: | Cherokee | chu: | Church Slavic | chv: | Chuvash | chy: | Cheyenne |
cmc: | Chamic languages | cop: | Coptic | cor: | Cornish | cos: | Corsican |
cpe: | Creoles and Pidgins, English-based (Other) | cpf: | Creoles and Pidgins, French-based (Other) | cpp: | Creoles and Pidgins, Portuguese-based (Other) | cre: | Cree |
crh: | Crimean Tatar | crp: | Creoles and Pidgins (Other) | csb: | Kashubian | cus: | Cushitic (Other) |
cze: | Czech | dak: | Dakota | dan: | Danish | dar: | Dargwa |
day: | Dayak | del: | Delaware | den: | Slavey | dgr: | Dogrib |
din: | Dinka | div: | Divehi | doi: | Dogri | dra: | Dravidian (Other) |
dsb: | Lower Sorbian | dua: | Duala | dum: | Dutch, Middle (ca. 1050-1350) | dut: | Dutch |
dyu: | Dyula | dzo: | Dzongkha | efi: | Efik | egy: | Egyptian |
eka: | Ekajuk | elx: | Elamite | eng: | English | enm: | English, Middle (1100-1500) |
epo: | Esperanto | est: | Estonian | ewe: | Ewe | ewo: | Ewondo |
fan: | Fang | fao: | Faroese | fat: | Fanti | fij: | Fijian |
fil: | Filipino | fin: | Finnish | fiu: | Finno-Ugrian (Other) | fon: | Fon |
fre: | French | frm: | French, Middle (ca. 1300-1600) | fro: | French, Old (ca. 842-1300) | frr: | North Frisian |
frs: | East Frisian | fry: | Frisian | ful: | Fula | fur: | Friulian |
gaa: | Gã | gay: | Gayo | gba: | Gbaya | gem: | Germanic (Other) |
geo: | Georgian | ger: | German | gez: | Ethiopic | gil: | Gilbertese |
gla: | Scottish Gaelic | gle: | Irish | glg: | Galician | glv: | Manx |
gmh: | German, Middle High (ca. 1050-1500) | goh: | German, Old High (ca. 750-1050) | gon: | Gondi | gor: | Gorontalo |
got: | Gothic | grb: | Grebo | grc: | Greek, Ancient (to 1453) | gre: | Greek, Modern (1453-) |
grn: | Guarani | gsw: | Swiss German | guj: | Gujarati | gwi: | Gwich'in |
hai: | Haida | hat: | Haitian French Creole | hau: | Hausa | haw: | Hawaiian |
heb: | Hebrew | her: | Herero | hil: | Hiligaynon | him: | Western Pahari languages |
hin: | Hindi | hit: | Hittite | hmn: | Hmong | hmo: | Hiri Motu |
hrv: | Croatian | hsb: | Upper Sorbian | hun: | Hungarian | hup: | Hupa |
iba: | Iban | ibo: | Igbo | ice: | Icelandic | ido: | Ido |
iii: | Sichuan Yi | ijo: | Ijo | iku: | Inuktitut | ile: | Interlingue |
ilo: | Iloko | ina: | Interlingua (International Auxiliary Language Association) | inc: | Indic (Other) | ind: | Indonesian |
ine: | Indo-European (Other) | inh: | Ingush | ipk: | Inupiaq | ira: | Iranian (Other) |
iro: | Iroquoian (Other) | ita: | Italian | jav: | Javanese | jbo: | Lojban (Artificial language) |
jpn: | Japanese | jpr: | Judeo-Persian | jrb: | Judeo-Arabic | kaa: | Kara-Kalpak |
kab: | Kabyle | kac: | Kachin | kal: | Kalâtdlisut | kam: | Kamba |
kan: | Kannada | kar: | Karen languages | kas: | Kashmiri | kau: | Kanuri |
kaw: | Kawi | kaz: | Kazakh | kbd: | Kabardian | kha: | Khasi |
khi: | Khoisan (Other) | khm: | Khmer | kho: | Khotanese | kik: | Kikuyu |
kin: | Kinyarwanda | kir: | Kyrgyz | kmb: | Kimbundu | kok: | Konkani |
kom: | Komi | kon: | Kongo | kor: | Korean | kos: | Kosraean |
kpe: | Kpelle | krc: | Karachay-Balkar | krl: | Karelian | kro: | Kru (Other) |
kru: | Kurukh | kua: | Kuanyama | kum: | Kumyk | kur: | Kurdish |
kut: | Kootenai | lad: | Ladino | lah: | Lahndā | lam: | Lamba (Zambia and Congo) |
lao: | Lao | lat: | Latin | lav: | Latvian | lez: | Lezgian |
lim: | Limburgish | lin: | Lingala | lit: | Lithuanian | lol: | Mongo-Nkundu |
loz: | Lozi | ltz: | Luxembourgish | lua: | Luba-Lulua | lub: | Luba-Katanga |
lug: | Ganda | lui: | Luiseño | lun: | Lunda | luo: | Luo (Kenya and Tanzania) |
lus: | Lushai | mac: | Macedonian | mad: | Madurese | mag: | Magahi |
mah: | Marshallese | mai: | Maithili | mak: | Makasar | mal: | Malayalam |
man: | Mandingo | mao: | Maori | map: | Austronesian (Other) | mar: | Marathi |
mas: | Maasai | may: | Malay | mdf: | Moksha | mdr: | Mandar |
men: | Mende | mga: | Irish, Middle (ca. 1100-1550) | mic: | Micmac | min: | Minangkabau |
mis: | Miscellaneous languages | mkh: | Mon-Khmer (Other) | mlg: | Malagasy | mlt: | Maltese |
mnc: | Manchu | mni: | Manipuri | mno: | Manobo languages | moh: | Mohawk |
mon: | Mongolian | mos: | Mooré | mul: | Multiple languages | mun: | Munda (Other) |
mus: | Creek | mwl: | Mirandese | mwr: | Marwari | myn: | Mayan languages |
myv: | Erzya | nah: | Nahuatl | nai: | North American Indian (Other) | nap: | Neapolitan Italian |
nau: | Nauru | nav: | Navajo | nbl: | Ndebele (South Africa) | nde: | Ndebele (Zimbabwe) |
ndo: | Ndonga | nds: | Low German | nep: | Nepali | new: | Newari |
nia: | Nias | nic: | Niger-Kordofanian (Other) | niu: | Niuean | nno: | Norwegian (Nynorsk) |
nob: | Norwegian (Bokmål) | nog: | Nogai | non: | Old Norse | nor: | Norwegian |
nqo: | N'Ko | nso: | Northern Sotho | nub: | Nubian languages | nwc: | Newari, Old |
nya: | Nyanja | nym: | Nyamwezi | nyn: | Nyankole | nyo: | Nyoro |
nzi: | Nzima | oci: | Occitan (post-1500) | oji: | Ojibwa | ori: | Oriya |
orm: | Oromo | osa: | Osage | oss: | Ossetic | ota: | Turkish, Ottoman |
oto: | Otomian languages | paa: | Papuan (Other) | pag: | Pangasinan | pal: | Pahlavi |
pam: | Pampanga | pan: | Panjabi | pap: | Papiamento | pau: | Palauan |
peo: | Old Persian (ca. 600-400 B.C.) | per: | Persian | phi: | Philippine (Other) | phn: | Phoenician |
pli: | Pali | pol: | Polish | pon: | Pohnpeian | por: | Portuguese |
pra: | Prakrit languages | pro: | Provençal (to 1500) | pus: | Pushto | que: | Quechua |
raj: | Rajasthani | rap: | Rapanui | rar: | Rarotongan | roa: | Romance (Other) |
roh: | Raeto-Romance | rom: | Romani | rum: | Romanian | run: | Rundi |
rup: | Aromanian | rus: | Russian | sad: | Sandawe | sag: | Sango (Ubangi Creole) |
sah: | Yakut | sai: | South American Indian (Other) | sal: | Salishan languages | sam: | Samaritan Aramaic |
san: | Sanskrit | sas: | Sasak | sat: | Santali | scn: | Sicilian Italian |
sco: | Scots | sel: | Selkup | sem: | Semitic (Other) | sga: | Irish, Old (to 1100) |
sgn: | Sign languages | shn: | Shan | sid: | Sidamo | sin: | Sinhalese |
sio: | Siouan (Other) | sit: | Sino-Tibetan (Other) | sla: | Slavic (Other) | slo: | Slovak |
slv: | Slovenian | sma: | Southern Sami | sme: | Northern Sami | smi: | Sami |
smj: | Lule Sami | smn: | Inari Sami | smo: | Samoan | sms: | Skolt Sami |
sna: | Shona | snd: | Sindhi | snk: | Soninke | sog: | Sogdian |
som: | Somali | son: | Songhai | sot: | Sotho | spa: | Spanish |
srd: | Sardinian | srn: | Sranan | srp: | Serbian | srr: | Serer |
ssa: | Nilo-Saharan (Other) | ssw: | Swazi | suk: | Sukuma | sun: | Sundanese |
sus: | Susu | sux: | Sumerian | swa: | Swahili | swe: | Swedish |
syc: | Syriac | syr: | Syriac, Modern | tah: | Tahitian | tai: | Tai (Other) |
tam: | Tamil | tat: | Tatar | tel: | Telugu | tem: | Temne |
ter: | Terena | tet: | Tetum | tgk: | Tajik | tgl: | Tagalog |
tha: | Thai | tib: | Tibetan | tig: | Tigré | tir: | Tigrinya |
tiv: | Tiv | tkl: | Tokelauan | tlh: | Klingon (Artificial language) | tli: | Tlingit |
tmh: | Tamashek | tog: | Tonga (Nyasa) | ton: | Tongan | tpi: | Tok Pisin |
tsi: | Tsimshian | tsn: | Tswana | tso: | Tsonga | tuk: | Turkmen |
tum: | Tumbuka | tup: | Tupi languages | tur: | Turkish | tut: | Altaic (Other) |
tvl: | Tuvaluan | twi: | Twi | tyv: | Tuvinian | udm: | Udmurt |
uga: | Ugaritic | uig: | Uighur | ukr: | Ukrainian | umb: | Umbundu |
und: | Undetermined | urd: | Urdu | uzb: | Uzbek | vai: | Vai |
ven: | Venda | vie: | Vietnamese | vol: | Volapük | vot: | Votic |
wak: | Wakashan languages | wal: | Wolayta | war: | Waray | was: | Washoe |
wel: | Welsh | wen: | Sorbian (Other) | wln: | Walloon | wol: | Wolof |
xal: | Oirat | xho: | Xhosa | yao: | Yao (Africa) | yap: | Yapese |
yid: | Yiddish | yor: | Yoruba | ypk: | Yupik languages | zap: | Zapotec |
zbl: | Blissymbolics | zen: | Zenaga | zha: | Zhuang | znd: | Zande languages |
zul: | Zulu | zun: | Zuni | zxx: | No linguistic content | zza: | Zaz |
Deprecated language codes, which may still appear in metadata records, are:
-ajm: | Aljamía | -cam: | Khmer | -esk: | Eskimo languages | -esp: | Esperanto |
-eth: | Ethiopic | -far: | Faroese | -fri: | Frisian | -gae: | Scottish Gaelix |
-gag: | Galician | -gal: | Oromo | -gua: | Guarani | -int: | Interlingua (International Auxiliary Language Association) |
-iri: | Irish | -kus: | Kusaie | -lan: | Occitan (post 1500) | -lap: | Sami |
-max: | Manx | -mla: | Malagasy | -mol: | Moldavian | -sao: | Samoan |
-scc: | Serbian | -scr: | Croatian | -sho: | Shona | -snh: | Sinhalese |
-sso: | Sotho | -swz: | Swazi | -tag: | Tagalog | -taj: | Tajik |
-tar: | Tatar | -tru: | Truk | -tsw: | Tswana |
The set of codes used for the Copyright field accessRights_t are:
cc-by-3.0: | CC BY 3.0 | cc-by-4.0: | CC BY 4.0 | cc-by-nc-3.0: | CC BY-NC 3.0 |
cc-by-nc-4.0: | CC BY-NC 4.0 | cc-by-nc-nd-3.0: | CC BY-NC-ND 3.0 | cc-by-nc-nd-4.0: | CC BY-NC-ND 4.0 |
cc-by-nc-sa-3.0: | CC BY-NC-SA 3.0 | cc-by-nc-sa-4.0: | CC BY-NC-SA 4.0 | cc-by-nd-3.0: | CC BY-ND 3.0 |
cc-by-nd-4.0: | CC BY-ND 4.0 | cc-by-sa-3.0: | CC BY-SA 3.0 | cc-by-sa-4.0: | CC BY-SA 4.0 |
cc-zero: | CC Zero | ic: | In-copyright | ic-world: | In-copyright (world viewable) |
icus: | US copyright | nobody: | Blocked | op: | Out-of-print |
orph: | Copyright-orphaned | orphcand: | Orphan | pd: | Public domain |
pd-pvt: | Access limited | pdus: | Public domain in US only | supp: | Suppressed from view |
und: | Undetermined | und-world: | Undetermined |
Example Solr JSON Records
An example record returned from SOLR is a useful way to see what fields are indexed, and from that fashion your query terms you enter into the Workset Builder interface. At the volume level, here is one of the records returned for the query that searches for: title_t:"USITC publication"
{ "lastRightsUpdateDate_t":["20190729"], "lastRightsUpdateDate_s":"20190729", "schemaVersion_s":"https://schemas.hathitrust.org/EF_Schema_MetadataSubSchema_v_3.0", "genre_ss":["http://id.loc.gov/vocabulary/marcgt/doc", "http://id.loc.gov/vocabulary/marcgt/gov"], "typeOfResource_s":"http://id.loc.gov/ontologies/bibframe/Text", "language_ss":["eng"], "accessRights_t":["pd"], "oclc_t":["4263346"], "accessRights_s":"pd", "htid_s":"http://hdl.handle.net/2027/mdp.39015084961401", "pubPlaceId_ss":["http://id.loc.gov/vocabulary/countries/dcu"], "pubDate_t":["197X"], "publisherName_t":["U.S. Govt. Print. Off."], "htid_t":["http://hdl.handle.net/2027/mdp.39015084961401"], "lastRightsUpdateDate_i":20190729, "title_s":"USITC publication /", "title_t":["USITC publication /"], "id":"mdp.39015084961401", "mainEntityOfPageCatalogRecord_s":"https://catalog.hathitrust.org/Record/003922761", "pubPlaceType_ss":["http://id.loc.gov/ontologies/bibframe/Place"], "sourceInstitution_t":["MIU"], "contributorName_t":["United States International Trade Commission."], "sourceInstitution_s":"MIU", "contributorType_ss":["http://id.loc.gov/ontologies/bibframe/Organization"], "publisherId_ss":["http://catalogdata.library.illinois.edu/lod/entities/ProvisionActivityAgent/ht/U.S.%20Govt.%20Print.%20Off."], "pubPlaceName_ss":["District of Columbia"], "pubDate_s":"197X", "publisherName_ss":["U.S. Govt. Print. Off."], "accessProfile_s":"google", "pubPlaceName_t":["District of Columbia"], "dateCreated_i":20200209, "contributorName_ss":["United States International Trade Commission."], "accessProfile_t":["google"], "publisherType_ss":["http://id.loc.gov/ontologies/bibframe/Organization"], "contributorId_ss":["http://www.viaf.org/viaf/126322615"], "language_t":["eng"], "oclc_ss":["4263346"], "dateCreated_t":["20200209"], "dateCreated_s":"20200209", "bibliographicFormat_t":["PublicationVolume"], "bibliographicFormat_s":"PublicationVolume", "_version_":1676455128102600705
}
The equivalent page-level search, to return all page-based Solr indexed records, where the title of the volume the page comes from is "USITC publiction", the query would be: volumetitle_txt:"USITC publication". Note the addition of the prefix to the field 'volume': this helps separate out volume-only metadata searching from page-level based searching when it is combined with volume metadata. Note also the change of suffix from "_t" (used when searching only volume-level metadta) to "_txt" when searching at the page level: the latter suffix (_txt_ is very similar to the former (_t), only it does not get stored in the Solr index. There is no need to store it with the page-level record, as it can be retrieved when needed from the volume metadata recrod.
An example record returned by this query is as follows:
{ "volumeid_s":"ien.35556029988656", "id":"ien.35556029988656.page-000038", "volumedateCreated_i":20200209, "volumelastRightsUpdateDate_i":20170115, "volumebibliographicFormat_htrcstring":"PublicationVolume", "volumepubDate_htrcstring":"197X", "volumecontributorId_htrcstrings":["http://www.viaf.org/viaf/158040275"], "volumepublisherName_htrcstrings":["U.S. Govt."], "volumelanguage_htrcstrings":["eng"], "volumetypeOfResource_htrcstring":"http://id.loc.gov/ontologies/bibframe/Text", "volumepubPlaceType_htrcstrings":["http://id.loc.gov/ontologies/bibframe/Place"], "volumecontributorName_htrcstrings":["United States. National Transportation Safety Board."], "volumeoclc_htrcstrings":["6371936"], "volumegenre_htrcstrings":["http://id.loc.gov/vocabulary/marcgt/doc", "http://id.loc.gov/vocabulary/marcgt/gov"], "volumecontributorType_htrcstrings":["http://id.loc.gov/ontologies/bibframe/Jurisdiction"], "volumedateCreated_htrcstring":"20200209", "volumemainEntityOfPageCatalogRecord_htrcstring":"https://catalog.hathitrust.org/Record/002137135431181-4", "volumetitle_htrcstring":"Aircraft accident report /", "volumeschemaVersion_htrcstring":"https://schemas.hathitrust.org/EF_Schema_MetadataSubSchema_v_3.0", "volumepublisherId_htrcstrings":["http://catalogdata.library.illinois.edu/lod/entities/ProvisionActivityAgent/ht/U.S.%20Govt."], "volumelastRightsUpdateDate_htrcstring":"20170115", "volumepubPlaceName_htrcstrings":["District of Columbia"], "volumepubPlaceId_htrcstrings":["http://id.loc.gov/vocabulary/countries/dcu"], "volumesourceInstitution_htrcstring":"NWU", "volumepublisherType_htrcstrings":["http://id.loc.gov/ontologies/bibframe/Organization"], "volumeaccessProfile_htrcstring":"google", "_version_":1676440719503392768, "volumehtid_htrcstring":"http://hdl.handle.net/2027/ien.35556029988656", "volumeaccessRights_htrcstring":"pd"
}