Tweaking VARD: Aggressive Rules for Early Modern English Morphemes and Elisions

Since I have discussed how VARD behaves with character encoding and symbols, I will devote space to explaining how I tweaked VARD to standardize Jonathan Hope’s early modern drama corpus.

Given the size of Hope’s corpus, it required automating the process of comparing VARD’s output to the original play files. Erin Winter wrote a case-sensitive python script that generated a CSV recording all of VARD’s changes and their frequencies. I compared the original words to VARD’s normalizations, looking at only the highest frequencies. I looked at unique spellings changed within the frequency range of approximately 46,000 to 100 times, which amounted to nearly 3,000 cases. (There were approximately 58,000 unique spellings in the corpus changed 10 or fewer times.) To offer a glimpse, here are the 10 most frequent VARD normalizations for the early modern drama corpus:

ORIGINAL NORMALIZED FREQUENCY
haue have 45680
selfe self 18473
Ile Isle 16095
loue love 15666
thinke think 10450
mee me 10437
vpon upon 10287
owne own 10205
vp up 9704
’tis it is 9691

The CSV tracking normalizations proved a painless way to identify where VARD needed a gentle push in another direction. Note Ile in the above table. Yes, England is an island (of which writers were aware), but 16,095 changes to Isle seemed suspect. When I looked at files with VARD-inserted XML tags, it became obvious those Iles should have been standardized to I’lls. There, VARD was simply wrong. (I will devote the next post to where VARD goofs–sometimes amusingly–in standardization.)

By researching questionable corrections, I was able to formulate standardization rules more “aggressive” than which the program instantiates with. (You can locate the default rules in the file “rules.txt,” in VARD’s “training” folder.) These rules dictate modern letter substitutions for common early modern letter combinations. Examples of the rules are as follows:

CHARACTERS CHANGE TO LOCATION IN WORD
vv w Anywhere
ie y Anywhere

Given the above rules, when VARD processes the word alvvaies, the program may suggest multiple variants: alwaies and alvvays. This contributes to competing spellings for variations across documents standardized, which you can find proliferate when VARD handles early modern prepositions and adverbs, even words with hyphens (e.g.: ne’er, ne’re, nev’r normalize differently; should the hyphen be eliminated or maintained?).

My additions to “rules.txt” aided not only spelling standardization but expanding elisions. The rules mainly gave VARD an extra push to handle early modern English morphemes. While “rules.txt” contains the rule ie at the end of words can be changed to y, it didn’t have a rule to help with standardizing the common adverb ending lie. Here is a table of the rules I added:

CHARACTERS CHANGE TO LOCATION IN WORD
cyon tion End
lie ly End
shyp ship End
t’ *to_ Start
th’ *the_ Start
tiue tive End
vn un Start
vs us Anywhere
ynge ing End

While not comprehensive, the rules definitely aided VARD’s efforts. Of course, entering rules is only one step of the process. For the rules you add to the dictionary, you must manually train VARD to implement them.

* A final word regarding the entries I made to expand the elisions t’ and th’ when they begin words. I typed an underscore (_) to reflect that there is a space after to and the in the rules. VARD will recognize spaces for rule input. In the GUI the rule will be displayed with an underscore; you do not not type the underscores in. The rules worked, and the program properly expanded words after some manual training. It changed th’ambassador to the ambassador, t’change to to change.

Leave a Reply