VARD Normalization Errors

VARD decently standardizes Early Modern English. Sometimes, though, it makes questionable replacements.

ORIGINAL NORMALIZATION SHOULD BE
all’s ell’s all’s
caus’d cause caused
Cicilia Cicely Cicilia
courtesie curtsy courtesy
diuers divers diverse
hir his her
ile isle I’ll
ist first is’t
kild
killd
kilt killed
maister moister master
maist moist mayst
*nunc nuns nunc
Pauls Pals Paul’s
*qui queen qui
shees shoes she’s
she is
weele weal we’ll
we will
where’s whores where’s
where is

Of course, you will want to check how your VARD installation handles these words. VARD keeps a running list of changes it makes, which silently trains the program as it executes. It is good practice to examine what VARD changes certain words to. Your datasets will be different than mine. These changes are based on blatant errors the VEP team located in the early modern drama corpus. Datasets, based on their content, have different curation needs.

*The dataset we use is interspersed with different languages, especially Latin. I had to add foreign words–like nunc and qui–to the dictionary to prevent skewing the frequency of certain vocabulary.

Leave a Reply