Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The metadata used to facet the search is pulled from the bibliographic metadata for the volumes. Some metadata fields are not universal for volumes in HathiTrust. For example, not every volume has a Library of Congress classification. Faceting based on those non-universal criteria will limit search to only volumes where that metadata field applies. 

HT+BW supports:

  • Single word (unigram) searches
  • String searches - no lemmatization or standardization of the tokens (i.e. words) has been imposed

Use the plus and minus icons to add or remove search fields. Each field will perform a separate search, with each search term appearing as a line on the graph. 

...

Facet

...

Adjust Settings

...

Time

Time range can be adjusted by dragging the "Time" slider for begin point and end point to the left and right as necessary.

Metric

This selection allows you to choose how the numerical values are counted. Depending on what option you choose, the label of the y-axis of the graph is changed accordingly, and the chart values adjusted. The four options available are as follows:

  • % of words shows how frequently the unigram is used relative to all other tokens in the corpus, for the given year shown. This lets you see how often a word is used relative to the size of the corpus, without having to worry about things like whether there are more books in 1850 than 1900.
  • % of texts gives the number of texts that use your search terms at least once as a proportion of the total number of texts published that year. Unlike "% of words," it will not be skewed by a single book that uses a word hundreds of times; however, it may be impacted by changing sizes.
  • Word count  plots the actual count of the searched word as the y-value for the plot.
  • Text count plots a count as the y-value for the plot that is computed in the following way: only those volumes in which the searched word actually occurs are counted for creating the plot. So, each such volume registers as only a single count. (The word "text" is being used interchangeably with the word "volume".)

Case

  • Insensitive ignores the distinction between lowercase and uppercase characters when counting words 
  • Sensitive maintains the distinction between lowercase and uppercase.

Smoothing

Smoothing is a means to create a moving average over the data and to identify overall trends by removing jagged and discontinuous data points.  Often trends become more apparent when data is viewed as a moving average. Smoothing windows are weighted: the year shown is weighted the most heavily, and the weights decrease in each direction until the smoothing span is reached. Smoothing options are described below:

  • To see the raw data points, set smoothing to 0. 
  • To average one point on each side of a data point, set smoothing to 1, which counts the previous one, current one, and next one and divides that sum by 3. 
  • A smoothing setting of 5 means that 11 values will be averaged, 5 values on each side of the data point.

Access volumes in the HathiTrust Digital Library from HT+Bookworm

You can click on any point in the plot to see a listing of the volumes by decreasing order of contribution to the plot at that particular year. Each volume title is a hyperlink, clicking on which will take you to the corresponding volume in the HathiTrust Digital Library.

Examples

Soda vs Pop

Soda or pop? How do Americans in different places refer to their soft drinks? Besides a variety of scientific papers and journal articles arguing about it, the Pop vs Soda project plots the regional variations in the use of the terms "pop" and "soda" to describe soft drinks. Current statistics from the project are available at http://popvssoda.com/statistics/USA.html. Their statistics and mappings are interesting to read. However, what if we want to look back into the history and find the hidden statistics? Where can we get historical evidence for our question? What will the results look like if the statistics are based on authorized publications rather than people's voting online? Try Bookworm! With data extracted from millions of publications from 1940 to 2015, we visualized the” Soda to Pop Ratio” by state . The y-axis represents the publication states while the x-axis shows the word ratio of “soda” to “Pop”. For example, we can see from the graph that publications in Massachusetts use soda for the soft drinks almost ten times the frequency of using pop.

Image Removed

    Soda to Pop Ratio Graph 

We can also visualize the Soda-to-Pop ratio with the statistics from the Pop vs Soda Page. Now you can look at a whole picture to find answers to more questions. What are the states where the word “Soda” always dominate in publications? Are publication language sharing the same words preferences for soft drinks as people’s oral language do? Start your exploration with Bookworm.

Image Removed

Soda to Pop Ratio Graph  (Based on the Pop vs Soda Page statistics)

Peking to Beijing

What is the capital city of China? Interestingly, some would say Peking, while the others would say Beijing. Referring to the same city, Beijing is pretty close phonetically to the original Mandarin while Peking has been used for a longer time internationally. Some findings argue that the Chinese government is insistent on the more modern transliteration Beijing rather than Peking. What’s more, they claim that with China’s rapid development and increasing power, the trend of replacing Peking with Beijing grows. To further investigate this argument, we used Bookworm to find out the word usage in publications of six countries from the 1960s to 2010s. Then we generated a graph showing the log ratio of Peking to Beijing grouped by Country. The y-axis marks the publication country while x-axis shows the time of the publications. Blocks of different colors indicate different ranges of the ratio. Click on a block and you will find a list of related publications during a certain period in the country you pick. Try different settings and input various words, you will find more!

Image Removed

                             Log Ratio of Peking to Beijing (by country)

 Image Removed

                      Part of the Lists of Related Publications

Inuit or Eskimo

Another typical word preference discussion is “Inuit or Eskimo”. The term "Eskimo" is commonly used in Alaska to refer to all Inuit people, however, this name given by non-Inuit people is considered offensive in many other places. Linda Lanz, a Ph.D. in linguistics from Rice University in Houston, claims that “In Canada, the term Inuit is preferred over Eskimo, which is considered offensive.” She sums it up with “Canada: Inuit; United States (i.e., Alaska): Eskimo.”

...

Popularity of Years: 1950, 1951, 1952, 1953, 1954, 1955(smoothing: 0 years)

 


From this visual, we can see more clearly what the popularity of a single year is like. The popularity of year A always rockets up in 1-3 years after A itself. Then it decreases progressively. Decreasing rate is high at first and then tails off. Finally, the popularity maintains at a low level. Is this pattern universal? Does it reflect people’s memory and oblivion of history?

...

Word Popularity of 4 countries after 1945


Soda vs Pop

Soda or pop? How do Americans in different places refer to their soft drinks? Besides a variety of scientific papers and journal articles arguing about it, the Pop vs Soda project plots the regional variations in the use of the terms "pop" and "soda" to describe soft drinks. Current statistics from the project are available at http://popvssoda.com/statistics/USA.html. Their statistics and mappings are interesting to read. However, what if we want to look back into the history and find the hidden statistics? Where can we get historical evidence for our question? What will the results look like if the statistics are based on authorized publications rather than people's voting online? Try Bookworm! With data extracted from millions of publications from 1940 to 2015, we visualized the” Soda to Pop Ratio” by state . The y-axis represents the publication states while the x-axis shows the word ratio of “soda” to “Pop”. For example, we can see from the graph that publications in Massachusetts use soda for the soft drinks almost ten times the frequency of using pop.

Image Added

    Soda to Pop Ratio Graph 


We can also visualize the Soda-to-Pop ratio with the statistics from the Pop vs Soda Page. Now you can look at a whole picture to find answers to more questions. What are the states where the word “Soda” always dominate in publications? Are publication language sharing the same words preferences for soft drinks as people’s oral language do? Start your exploration with Bookworm.

Image Added

Soda to Pop Ratio Graph  (Based on the Pop vs Soda Page statistics)

Peking to Beijing

What is the capital city of China? Interestingly, some would say Peking, while the others would say Beijing. Referring to the same city, Beijing is pretty close phonetically to the original Mandarin while Peking has been used for a longer time internationally. Some findings argue that the Chinese government is insistent on the more modern transliteration Beijing rather than Peking. What’s more, they claim that with China’s rapid development and increasing power, the trend of replacing Peking with Beijing grows. To further investigate this argument, we used Bookworm to find out the word usage in publications of six countries from the 1960s to 2010s. Then we generated a graph showing the log ratio of Peking to Beijing grouped by Country. The y-axis marks the publication country while x-axis shows the time of the publications. Blocks of different colors indicate different ranges of the ratio. Click on a block and you will find a list of related publications during a certain period in the country you pick. Try different settings and input various words, you will find more!

Image Added

Log Ratio of Peking to Beijing (by country)


Image Added

Part of the Lists of Related Publications

References

Michel, Jean-Baptiste, Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., ... & Aiden, E. L. "Quantitative analysis of culture using millions of digitized books." Science331.6014 (2011): pp. 176-182.

...