Forcing Standardization in VARD, Part 2

The final aspect of standardization I will discuss will be common early modern spellings forced to modern equivalents, decisions where the payoff of consistency outweighs slight data loss.

The VEP team decided to force bee > be, doe > do, and wee > we.

Naturally one can see the problems inherent to these forced standardizations.

Bee in early modern spelling can stand for the insect as well as the verb. Similarly with doe, it can signify a deer or a verb. For wee, it can be either an adjective or a pronoun. We hypothesized for our drama corpus that 1) bee would be overwhelmingly the verb; 2) doe would overwhelmingly be the verb; 3) wee would overwhelmingly be the pronoun.

The decision to force these words was supported by sampling frequency of meanings in the early modern drama corpus, along with frequencies from Anupam Basu’s EEBO-TCP Key Words in Context tool set to original spelling, offered by Early Modern Print: Text Mining Early Printed English.

My method to determine meaning frequency is as follows:

  1. I searched for the first 1,000 instances of a spelling in the early modern corpus and Key Words in Context.
  2. I generated CSVs of the 1,000 hits of the spellings in question, including surrounding text to gain context and determine the word’s signification
  3. When I located a word that deviated from the meaning VEP projected the spelling would be associated with, I highlighted the entry and took notes in a column beside the line
  4. After I read through 1,000 instances of the spelling, I tallied the number of times the word did not match our hypothesized meaning.


EM Drama 17 1,000 1.7%
Key Words 71 1,000 7.1%

Bee as bee is higher in the first 1,000 hits of Key Words for generic reasons. Key Words contains all of EEBO-TCP. There are early dictionaries (Thomas Elyot) and husbandry texts (John Fitzherbert). Morever, compilers like George Gascoigne recognized the metaphorical power of the bee’s work–travelling from flower to flower to make sweet honey–and used it as meta-commentary for their labor gathering the most delightful and edifying writing.


EM Drama 0 1,000 0%
Key Words 1 1,000 .1%

I further looked into variant spellings of the conjugation does in the drama corpus, to see how common the animal would be opposed to the verb. Searching for does in the corpus yielded one instance of the animal in the first 1,000 instances of the spelling (.1%). Searching doe’s in the corpus yielded 158 instances of the spelling, all which were the verb.

The above results suggest minimal data loss for standardizing all instances of doe to do in the drama corpus.

It is harder to pin down figures for this decision.

Searching for wee in the early modern drama corpus, I identified 4 of the first 1,000 instances that were not the pronoun. One looked like it should have been well, another looked like an elision of God be with yee (God b’wee). The remaining two instances were French, which standardized are to be oui.

Based on the first 1,000 instances of wee in Key Words in Context, there was too much noise. It seems that text you search in Key Words in Context doesn’t preserve TCP notation for illegible characters, the bullet (•). There were many places I had to look at the original TCP files to determine the signification of wee because the pronoun we and the adjective wee didn’t make sense. When consulting the file, I matched wee to words with illegible characters (e.g, we•e).

What do these standardizations mean for the drama corpus?
If you work on bee and deer imagery in early modern drama, you will want to look somewhere else. For the bee example, if the 17 in 1,000 instances of the spelling bee as insect holds steady over the 6,694 instances of bee in the drama corpus, that means ~113 of those 6,694 spellings of bee refer to the insect. Overall, with an error rate of 1.7%, data loss in the corpus is minimal when the spelling bee is forced to be.

Granted, I looked at the first 1,000 instances of spellings in the corpus and in Key Words. Consequently I reviewed inconsistent portions of these corpora. The VEP team decided the sampling was telling for the context of the drama corpus. Another inconsistency with the files is the order in which they were searched between Key Words and the drama corpus. Key Words doesn’t provide the user with options for ordering the results, therefore the words are displayed in chronological order. For the drama corpus, files were searched from smallest to largest TCP file number. Overall, the frequency of significations suggest small margins of error for the standardizations of bee, doe, and wee within the corpus.

