In addition, each of the meta files has 37 columns. They are as follows:
The unique volume identifier used by Underwood et al. matches the HathiTrust Volume ID (HTID), except that identifiers have been made filename-safe by substituting '+' for ':' and '=' for '/'.
Volume HTID. Substantially redundant with 'docid' except as noted above.
A short, cleaned version of the volume author, copied from Underwood et al.
A short, cleaned version of the volume title, copied from Underwood et al.
A best guess of the volume's publication date, copied from Underwood et al.
MARC entries indicating the geographic subject matter of the text, copied from Underwood et al. Coverage is uneven (about 10% of volumes).
A normalized version of the unique string identified as a location in a text. Original strings have been lowercased, unaccented, have had punctuation replaced with spaces, and have been whitespace regularized. For example, 'London' becomes 'london'; 'Île-de-France' becomes 'ile de france'; and ' New York ' becomes 'new york'.
Integer count of the number of times a given normalized string occurs in the volume.
The two-letter country code representing the publication location of the volume, where known. This can be useful as a proxy for the national origin of the volume.
A standard, long-form representation of the geographic location to which the text_string corresponds. Example: 'New York, NY, USA'.
The type of location. Examples include 'country', 'administrative_area_level_1' (like a US state, Canadian province, or 'England' in the UK), 'locality', 'church', 'bar', and 'route'. There are 118 distinct values.
Long-form name of the country within which the location is found.
Short-form (two letter) version of the country name where the location is found.
Long-form version of the top-level administrative area to which the location belongs. Corresponds to US states, French regions, etc.
Short-form version of the admin_1 label. Two letters for US states, but can be of arbitrary length.
Second-level administrative area. For example, US counties.
'admin_3' | 'admin_4' | 'admin_5'
Third-, fourth-, and fifth-level administrative areas.
City, town, village, etc.
First-level administrative subunit of a locality. The boroughs of New York City correspond to this level.
'sublocality_2' | 'sublocality_3'
Second- and third-level administrative subunits of a locality.
Possibly colloquial, does not necessarily correspond to an administrative district.
A named building or similar.
Part of a premise.
The number portion of a street address. May contain non-numeric data.
Full street address of buildings and similarly addressable point locations. Lightly used. Not to be confused with 'formatted_address', which is always present.
Name of the road for road-like locations.
Alphanumeric postal code of the location, if applicable.
The name of lakes, rivers, mountain ranges, continents, etc. Can be especially useful for locations that span administrative boundaries, e.g., 'Atlantic Ocean'.
Inconsistently applied label by Google for locations of some sort of interest. PI deemed it unreliable for general use, but it was left in as it might be of use in rare cases.
Unofficial names for locations and regions, including those that do not correspond to any existing administrative unit. Examples: 'Central Africa', 'Caribbean'.
The name of a continent, in cases of direct reference to the continent itself. Unfortunately, while most other fields are supplied hierarchically (that is, locations at lower administrative levels include data for the higher-level administrative areas to which they belong), this field is not. If one needs to aggregate locations by continent, one must supply an independent mapping between countries (or other location types) and continents.
A lightly used catch-all for unclassifiable location types.
'lat' | 'lon'
The specific latitude and longitude of the locations. For non-point locations, this is a single point near the geographic center of the area.