written by: Michael Gleicher, Jonathan Hope, Michael Witmore, and Deidre Stuffer
A key tenet of the Visualizing English Print project is that by considering larger collections of documents, we are able to observe larger amounts of variation, which can allow us to observe phenomena of interest to scholars.
Unfortunately, variation also brings with it challenges. Much of this stems from the fact that as corpora grow in size and complexity, more kinds of variation emerge (as well as more of each kinds of variation). Some of these we didn’t anticipate, and some have turned out to be more important (and difficult) to address than we expected.
Therefore, in the VEP project we have had to make decisions that help us mitigate “undesirable” variation, such that we can better focus on the kinds of variation of interest.
Consider the following five categories of variation:
- Historical variation of the kinds we are interested in.
- Historical variation that we didn’t know we were interested in (or may not be interested in ).
- Variation introduced by the artifacts (physical documents) and their history.
- Variation introduced by the digitization process of these artifacts.
- Variation introduced by the processing of the digital artifacts
These categories can be viewed chronologically: 1&2 occur in the time period; 3 is what happens from the time the artifact is created until when it is digitized; 4 is what happens during digitization; and 5 is what happens with the digitized files.
For us, 1&2 is everything that went into the book. The words the author chose and the way that the printer chose to put those words onto the page. If we’re interested in the history of the period when the book was written, these are our primary interests. The distinction between 1 and 2 (whether it’s a kind of variation we know we are interested in, or a kind of variation that either we aren’t interested in (yet), or didn’t know about will come back later.
#3 is what happens to the book after it is printed. This may be interesting in its own right. For example, if a book is considered important or valuable, it is more likely to be preserved well (thanks to Laura Mandell for making us aware of these kinds of concerns). However, this is a statement about the time after the book was written – not about what was in the book itself (except, maybe indirectly since the contents of the book influence what later people thought about it and did with it ). And the contemporary perception of the importance of the writer or subject matter may affect the quality of paper used for a book, the amount of attention paid during the printing process, and indeed the choice of printer.
#4 is what happens from the time the book is pulled off the shelf to be imaged until we receive a data file. This includes imaging (e.g., microfilming), transcription (in the case of TCP), data manipulation, editing and curation, etc. Creating digital versions of a large text corpus is a massive undertaking, involving many people working over many years. Some variation in how texts are handled is probably unavoidable. Because digitization will often require some degree of judgment and decision making, the prior kinds of variation might be attenuated.
#5 is what happens from the time we receive the file through when we do our analysis.
Let us offer a fictional (but realistic) example. It is naïve to think that the English language can be represented with 26 letters (pun intended), but supposed we see that in some document, the word naïve is spelled wrong (i.e., with a regular i, not one with an umlaut over it). Where did this variation in spelling come from?
- Did the author intend to spell it this way? (telling us something about how they thought about words)
- Did the printer change it in typesetting? (either because they thought it was spelled with an i,, or maybe they ran out of i-umlauts?)
- Did the page get smudged so that the second dot of the umlaut can’t be seen?
- Did the transcriber (or OCR engine) make an incorrect decision? Or, on the particular day that page was transcribed, was a different encoding of the i with an umlaut over it used than on most other days?
- Did the file reading library I use incorrectly interpret the Unicode character?
Ideally, we would be able to understand our variation.
Interesting and (not-yet) interesting variation
Variation type 1 (interesting historical) and 2 (not-yet interesting historical – we used to say “uninteresting” but often these things become interesting once you start to find out about them) is important.
If we were interested in how a particular writer switched between spellings of the word naïve in order to indicate something, or more generally about the decline of the use of umlauts in the 17th century, then the variation in whether the letter I has one or two dots over it when surrounded by the letters n, a, v and e would be interesting (or type 1 variation). If this is the case, we would probably want to be really careful to make sure that (to the extent possible) if we saw the variation, we could attribute it historically (to either the writer or publisher) – not the transcriber, Unicode library author, or even the librarian.
It is tempting to think that “we should just be careful about everything.” Since we don’t know whether or not the umlauts are interesting, we should preserve them. For something as simple and specific as umlauts, it might be OK. But this is a bad approach in general for a number of reasons:
- It can be costly. Each thing you try to preserve probably has a number of costs (e.g., to make sure we don’t make mistakes with umlauts, we must make sure that all software tools we use properly handle them, that the transcriber consistently identifies and codes them …).
- You can’t predict everything ahead of time, and if you try you end up on a slippery slope. First, you want i-umlauts differentiated from i. Then you find you’re interested in the size and shape of the umlauts. Then you want the micro-patterns within the umlauts (maybe for a theory that because the printers resented using them they didn’t clean them as well, so the ink wasn’t applied as uniformly). And so on.
- It matters statistically. Post-hoc arguments are different than a priori ones. Saying that you found a one-in-a-million case isn’t so surprising when you look at a billion things (you are likely to find many). Predicting which one ahead of time is different.
- It also matters for searching: if you are interested in instances of the feminine pronoun ‘she’, but your encoding has retained all the possible spellings: “she”, “shee”, “She”, “Shee”, “fhe”, etc. To find them all you need to remember all the possible variants.
How to deal with unwanted variation?
Interesting variation is good: it’s where we see the differences between things.
Variation can also be bad: the differences between things might obscure other similarities of patterns. For example, the fact that naïve might be spelled wrong because of software issues might cause us to miss that two different authors are saying the same thing – albeit with different spellings.
Getting rid of variation is hard – especially when it isn’t well understood. While it might be easy to see that naïve and its misspellings have the same meaning, other cases are more of a judgment call.
Standardization or normalization is the process of removing variance. It requires recognizing that two (or more) things should be the same, and replacing them with a consistent thing. Both steps are potentially problematic: it can be difficult to determine if things really are equivalent, and it can be difficult to choose an appropriate consistent representation for both. For an example of what this process may look like, see the following blog entry on forcing Early Modern English spellings to certain modern English equivalents.
Specific Kinds of Variation: As we consider more documents, the range of these documents grows larger. We can consider a broader range of authors, writing about a broader range of subjects, in a broader range of document types, using a broader range of styles. We might expect connections between these kinds of variability: styles and subjects go in and out of favor over time, different authors use different words, etc.