We have made great progress with Jonathan Hope’s early modern drama corpus. It now includes plays dated up through 1700, built from TCP corpora. By my count, it is comprised of 1,257 plays. A corpus of this size and origin requires considerable curation. Beth Ralston has spearheaded metadata collection and cross-referencing–quite the feat–from Glasgow. In Madison, the VEP team has worked on extracting necessary text from TCP XML files. This effort involved writing and tweaking python scripts specifically for TEI P4 versions of the TCP offerings. Additionally, the team has consulted with Hope to attempt standardizing the corpus’s unwieldy early modern orthography.
To standardize the drama corpus, we are using VARD, a tool that aids spelling standardization within historical corpora. While VARD’s default is made to process Early Modern English texts, it achieves standardization by checking words against a modern dictionary. Therefore, using VARD necessitates modernization.
Though not an exhaustive list, modernization in VARD entails: changing Early Modern English second and third person verb endings to modern ones (-eth to -s); expanding elisions (o’er to over); changing variants of a word to a preferred form of that word (ope and op’n to open); joining separate words into a modern equivalent (him selfe into himselfe).
VARD standardizes 1-grams, that is, it evaluates one word at a time. Part of speech isn’t taken into account. VARD assesses a 1-gram against its dictionary (“words.txt”) and determines whether the 1-gram’s spelling is a non-variant or a variant. If the spelling matches a word in the dictionary, the 1-gram is marked as a non-variant and left alone. When a 1-gram’s spelling varies from what is in the dictionary, it is marked as a variant. VARD contains rules that manage how a variant is modernized, mainly a text file of modern spelling substitutions for early modern ones (“rules.txt”). This works at the level of the letter. For example, when VARD processes the word musick, a rule in the dictionary indicates that CK at the end of words can be replaced with just a C, enabling VARD to change musick to music. The rules allow for multiple variations based on a variant spelling to occur.
To decide which variant a non-variant spelling will be changed into, VARD performs word frequency calculations (f-scores). The calculation is a confidence score that weights how probable a variant replacement is. VARD keeps track of how many times it replaces words with specific variants. The f-score is a measure of the precision of a variant’s spelling and its recall (percentage based on how many times a specific variation has replaced the marked variant before). VARD has a normalization threshold that uses the confidence score to determine with which variant to replace a marked variant. That number is by default 50%. If a variant’s confidence score is above 50%, it will replace the early modern word under question. F-scores can be weighted to equally consider precision and recall, or it can be weighted to favor one or the other. In other words, VARD considers how to replace words according to what may be most correct or most probable based on how many times words occur within a corpus.
Through a GUI, VARD allows users to manually correct texts one at a time or batch process based on the normalization threshold. Users can also process texts through command line. Standardizing the Early Modern drama corpus has required both.
Quite a few posts re: the VARD saga will follow this one, as there are many aspects to explain about our curation experiences. A preview of post topics follow below:
- Encoding issues before the TEI P4 XML files could be processed by VARD
- Standardization principles across the corpus (elisions, pronouns, contractions)
- Counteracting VARD’s questionable, if not amusing, word replacements
- Generating aggressive spelling standardization rules
Following posts will contain documentation for the early modern drama corpus and tips/tricks for VARD.