Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Excerpt

ACS research team discusses their midpoint findings for the Semantic Phasor Embeddings project.





ACS awardees: Molly Des Jardin, Scott Enderle, Katie Rawson (University of Pennsylvania)

Much recent discussion of quantitative research in the humanities has concerned scale. Confronted with the vast quantities of data produced by digitization projects over the last decade, humanists have begun exploring ways to synthesize that data to tell stories that could not have been told before. Our ACS project aims to make that kind of work easier by creating compact, non-expressive, non-consumptive representations of individual volumes as vectors. These vectors will contain information not only about the topics the volumes cover, but also about the way they order that coverage from beginning to end. Our hope is that these representations will allow distant readers to investigate the internal structures of texts at larger scales than have been possible before. But now that we've reached the midpoint of our work, our preliminary results have led to some surprising reflections about scale at much smaller levels.

...

Although Fourier transforms have been used in the past by other projects for the purpose of smoothing out noise, our aim is different. (Indeed, if we could preserve all frequency bands without breaking the HathiTrust terms of service, we would!) Instead, we use Fourier transforms to create orthogonal representations of fluctuations at different scales, called phasors, which can be added and subtracted in structure-preserving ways. The mathematical properties of phasors make them well suited for the same kinds of algebraic manipulations that allow word word vectors to represent analogies.

Example vector analogiesImage Added

Left: word vectors for London, England, Moscow, and Russia.  Right: the vector operations representing an analogy.

Just as word vectors allow us to express the idea that Moscow is to Russia as London is to England using a mathematical equation Moscow – London - England + Russia + England = LondonMoscow – phasors might allow us to represent structural analogies between texts, identifying documents that discuss different topics using the same underlying organization.

...

How can we do better? It might seem at first that we could add more information, perhaps in the form of larger n-gram windows, for example. But in fact, after a certain point, adding more information will make things worse, at least if the information comes in the form of additional independent dimensions. In very high-dimensional space, the distance between points becomes more and more narrowly distributed, so that most points are about the same distance from one another. Even very complex datasets start looking like smooth, round balls. This makes it increasingly hard to distinguish between points that are close to each other for interesting reasons, and points that are close to each other by pure coincidence. (For the mathematically inclined, this phenomenon is called concentration of measure.)

Given these challenges, paying attention to word order seems like a promising strategy. And our preliminary results provide some confirmation of that hunch.

...