Forcing Standardization in VARD, Part 2

The final aspect of standardization I will discuss will be common early modern spellings forced to modern equivalents, decisions where the payoff of consistency outweighs slight data loss.

The VEP team decided to force bee > be, doe > do, and wee > we.

Naturally one can see the problems inherent to these forced standardizations.

Bee in early modern spelling can stand for the insect as well as the verb. Similarly with doe, it can signify a deer or a verb. For wee, it can be either an adjective or a pronoun. We hypothesized for our drama corpus that 1) bee would be overwhelmingly the verb; 2) doe would overwhelmingly be the verb; 3) wee would overwhelmingly be the pronoun.

The decision to force these words was supported by sampling frequency of meanings in the early modern drama corpus, along with frequencies from Anupam Basu’s EEBO-TCP Key Words in Context tool set to original spelling, offered by Early Modern Print: Text Mining Early Printed English.

My method to determine meaning frequency is as follows:

  1. I searched for the first 1,000 instances of a spelling in the early modern corpus and Key Words in Context.
  2. I generated CSVs of the 1,000 hits of the spellings in question, including surrounding text to gain context and determine the word’s signification
  3. When I located a word that deviated from the meaning VEP projected the spelling would be associated with, I highlighted the entry and took notes in a column beside the line
  4. After I read through 1,000 instances of the spelling, I tallied the number of times the word did not match our hypothesized meaning.

BEE > BE

CORPUS INSTANCES OF INSECT INSTANCES OF SPELLING PERCENTAGE OF ERROR
EM Drama 17 1,000 1.7%
Key Words 71 1,000 7.1%

Bee as bee is higher in the first 1,000 hits of Key Words for generic reasons. Key Words contains all of EEBO-TCP. There are early dictionaries (Thomas Elyot) and husbandry texts (John Fitzherbert). Morever, compilers like George Gascoigne recognized the metaphorical power of the bee’s work–travelling from flower to flower to make sweet honey–and used it as meta-commentary for their labor gathering the most delightful and edifying writing.

DOE > DO

CORPUS INSTANCES OF ANIMAL INSTANCES OF SPELLING PERCENTAGE OF ERROR
EM Drama 0 1,000 0%
Key Words 1 1,000 .1%

I further looked into variant spellings of the conjugation does in the drama corpus, to see how common the animal would be opposed to the verb. Searching for does in the corpus yielded one instance of the animal in the first 1,000 instances of the spelling (.1%). Searching doe’s in the corpus yielded 158 instances of the spelling, all which were the verb.

The above results suggest minimal data loss for standardizing all instances of doe to do in the drama corpus.

WEE > WE
It is harder to pin down figures for this decision.

Searching for wee in the early modern drama corpus, I identified 4 of the first 1,000 instances that were not the pronoun. One looked like it should have been well, another looked like an elision of God be with yee (God b’wee). The remaining two instances were French, which standardized are to be oui.

Based on the first 1,000 instances of wee in Key Words in Context, there was too much noise. It seems that text you search in Key Words in Context doesn’t preserve TCP notation for illegible characters, the bullet (•). There were many places I had to look at the original TCP files to determine the signification of wee because the pronoun we and the adjective wee didn’t make sense. When consulting the file, I matched wee to words with illegible characters (e.g, we•e).

What do these standardizations mean for the drama corpus?
If you work on bee and deer imagery in early modern drama, you will want to look somewhere else. For the bee example, if the 17 in 1,000 instances of the spelling bee as insect holds steady over the 6,694 instances of bee in the drama corpus, that means ~113 of those 6,694 spellings of bee refer to the insect. Overall, with an error rate of 1.7%, data loss in the corpus is minimal when the spelling bee is forced to be.

Granted, I looked at the first 1,000 instances of spellings in the corpus and in Key Words. Consequently I reviewed inconsistent portions of these corpora. The VEP team decided the sampling was telling for the context of the drama corpus. Another inconsistency with the files is the order in which they were searched between Key Words and the drama corpus. Key Words doesn’t provide the user with options for ordering the results, therefore the words are displayed in chronological order. For the drama corpus, files were searched from smallest to largest TCP file number. Overall, the frequency of significations suggest small margins of error for the standardizations of bee, doe, and wee within the corpus.

Forcing Standardization in VARD, Part 1

Optimizing VARD for the early modern drama corpus required “forcing” lexical changes to create higher levels of standardization in the dataset. Jonathan Hope gave me editorial principles to follow as we considered what words/patterns VARD should change that it wasn’t. We wanted to standardize prepositions, expand elisions, and preserve verb endings. Unfortunately, preserving Early Modern verb endings (-st, –th) would require an overhaul of VARD’s dictionary.

There were three routes I followed to force standardization: manually selecting variants over others to change confidence scores; marking non-variants as variants and inputting their standardized form; adding words to the dictionary.

For the early modern drama corpus, the VEP team identified two grammatical features for forced standardization. We decided to implement consistent spelling for pronouns, adverbs, and prepositions; and expanding elisions that would interfere with algorithmic analysis, like topic modeling. Granted, more could have been changed, but we erred on the side of caution to see how effective the changes would be overall.

After documenting forced changes, I will discuss their implications for the dataset, which will come in the next entry.

RULES TO FORCE ELISION EXPANSION (read more here)
t’ to_ Start
th’ the_ Start

PRONOUNS AND CONTRACTIONS
hee > he
hir > her
ide > I’d
ile > I’ll
i’le > I’ll
she’s > she’s
shees > she’s
* wee > we

ADVERB CONTRACTIONS
heeres > here’s
heere’s > here’s
theres > there’s
ther’s > there’s
wheres > where’s
where’s > where’s

ADVERBS/PREPOSITIONS
aboue > above
ne’er > never
ne’re > never
nev’r > never
o’er > over
oe’r > over
ope > open
op’n > open

WORDS ADDED TO DICTIONARY: Cupid, Damon, Leander, Mathias, nunc, Paul’s, Piso, qui, quod, tis, twas, twere, twould

MARKED AS VARIANTS FOR CORRECTION: greene > green, lockes > locks, vs > us, wilde > wild

* I will discuss the implications of our decision for wee in the next entry.

VARD Normalization Errors

VARD decently standardizes Early Modern English. Sometimes, though, it makes questionable replacements.

ORIGINAL NORMALIZATION SHOULD BE
all’s ell’s all’s
caus’d cause caused
Cicilia Cicely Cicilia
courtesie curtsy courtesy
diuers divers diverse
hir his her
ile isle I’ll
ist first is’t
kild
killd
kilt killed
maister moister master
maist moist mayst
*nunc nuns nunc
Pauls Pals Paul’s
*qui queen qui
shees shoes she’s
she is
weele weal we’ll
we will
where’s whores where’s
where is

Of course, you will want to check how your VARD installation handles these words. VARD keeps a running list of changes it makes, which silently trains the program as it executes. It is good practice to examine what VARD changes certain words to. Your datasets will be different than mine. These changes are based on blatant errors the VEP team located in the early modern drama corpus. Datasets, based on their content, have different curation needs.

*The dataset we use is interspersed with different languages, especially Latin. I had to add foreign words–like nunc and qui–to the dictionary to prevent skewing the frequency of certain vocabulary.

Tweaking VARD: Aggressive Rules for Early Modern English Morphemes and Elisions

Since I have discussed how VARD behaves with character encoding and symbols, I will devote space to explaining how I tweaked VARD to standardize Jonathan Hope’s early modern drama corpus.

Given the size of Hope’s corpus, it required automating the process of comparing VARD’s output to the original play files. Erin Winter wrote a case-sensitive python script that generated a CSV recording all of VARD’s changes and their frequencies. I compared the original words to VARD’s normalizations, looking at only the highest frequencies. I looked at unique spellings changed within the frequency range of approximately 46,000 to 100 times, which amounted to nearly 3,000 cases. (There were approximately 58,000 unique spellings in the corpus changed 10 or fewer times.) To offer a glimpse, here are the 10 most frequent VARD normalizations for the early modern drama corpus:

ORIGINAL NORMALIZED FREQUENCY
haue have 45680
selfe self 18473
Ile Isle 16095
loue love 15666
thinke think 10450
mee me 10437
vpon upon 10287
owne own 10205
vp up 9704
’tis it is 9691

The CSV tracking normalizations proved a painless way to identify where VARD needed a gentle push in another direction. Note Ile in the above table. Yes, England is an island (of which writers were aware), but 16,095 changes to Isle seemed suspect. When I looked at files with VARD-inserted XML tags, it became obvious those Iles should have been standardized to I’lls. There, VARD was simply wrong. (I will devote the next post to where VARD goofs–sometimes amusingly–in standardization.)

By researching questionable corrections, I was able to formulate standardization rules more “aggressive” than which the program instantiates with. (You can locate the default rules in the file “rules.txt,” in VARD’s “training” folder.) These rules dictate modern letter substitutions for common early modern letter combinations. Examples of the rules are as follows:

CHARACTERS CHANGE TO LOCATION IN WORD
vv w Anywhere
ie y Anywhere

Given the above rules, when VARD processes the word alvvaies, the program may suggest multiple variants: alwaies and alvvays. This contributes to competing spellings for variations across documents standardized, which you can find proliferate when VARD handles early modern prepositions and adverbs, even words with hyphens (e.g.: ne’er, ne’re, nev’r normalize differently; should the hyphen be eliminated or maintained?).

My additions to “rules.txt” aided not only spelling standardization but expanding elisions. The rules mainly gave VARD an extra push to handle early modern English morphemes. While “rules.txt” contains the rule ie at the end of words can be changed to y, it didn’t have a rule to help with standardizing the common adverb ending lie. Here is a table of the rules I added:

CHARACTERS CHANGE TO LOCATION IN WORD
cyon tion End
lie ly End
shyp ship End
t’ *to_ Start
th’ *the_ Start
tiue tive End
vn un Start
vs us Anywhere
ynge ing End

While not comprehensive, the rules definitely aided VARD’s efforts. Of course, entering rules is only one step of the process. For the rules you add to the dictionary, you must manually train VARD to implement them.

* A final word regarding the entries I made to expand the elisions t’ and th’ when they begin words. I typed an underscore (_) to reflect that there is a space after to and the in the rules. VARD will recognize spaces for rule input. In the GUI the rule will be displayed with an underscore; you do not not type the underscores in. The rules worked, and the program properly expanded words after some manual training. It changed th’ambassador to the ambassador, t’change to to change.

VARD & ASCII Symbols

Yes, even ASCII symbols mess up VARD.

Those who have tried to extract plain text from TCP TEI P4 or P5 XML files know how difficult it is. While coding tools to extract TCP text, the VEP team grappled with the order of operations to perform. Where is the best place in an extraction pipeline to convert the XML document to text? Where do we want to use VARD?

As discussed in my previous post, processing XML files through VARD can be tricky. Non-ASCII symbols and XML tags interrupt the words that VARD needs to check against its dictionary, preventing VARD from recognizing words in their entirety.

For the most part, VARD cannot process even ASCII symbols as part of words, which has implications for extracting and representing TCP XML files. In order to process TCP XML, the VEP team has had construct its character cleaner and text extractor to work with VARD’s constraints regarding symbols and XML tags. Furthermore, character cleaning and text extraction had to align with editorial principles. To illustrate, the team had to consider the extent to which its algorithms modified TCP text. TCP XML file structure and contents further complicated the modification. When extracting text, did we only want to extract what was definite (the characters) or also preserve the traces of illegibility (characters represented by symbols)?

In the end, VEP decided to design character cleaning and text extraction tools that preserve textual information. It required figuring out character substitutions that worked with VARD to account for symbols nested within words. If a word contained illegible characters, the number of illegible characters would be maintained. However, the TCP’s bullet point that represents illegible characters doesn’t allow VARD to read the surrounding characters as one word.

VARD2To address the dilemma, I generated a test text file with a word that had symbols interrupting it, quite like you will find in TCP corpora. I recreated the test for the post today, using the word unworthinesse. I wanted to see which ASCII symbols VARD would treat as part of words. As you can see in the screen capture to the left of VARD’s GUI, VARD successfully treats several ASCII characters as part of words–the entire word is highlighted. For the symbols not treated as part of the word,VARD doesn’t highlight them. Unsurprisingly VARD treats hyphens (-) as part of a word. Hyphens are a common feature of compound adjectives. Other ASCII symbols VARD recognizes are the tilde (~), the caret (^), and the equals sign (=).

When designing the character cleaner for TCP corpora, the VEP team leveraged the knowledge of how VARD handles ASCII symbols in the following way:

  1. Illegible characters(bullet: •) replaced by the caret (^). TCP: we•e | VEP: we^re
  2. Unrecognized punctuation (small black square: ▪) replaced by the asterisk (*). TCP: long ago▪ | VEP: long ago*
  3. Unrecognized characters and common textual symbols (e.g., the pilcrow (¶)) replaced by at sign (@). TCP: ¶Behold, | VEP unstripped text: @Behold, | VEP stripped text: Behold,
  4. Missing words (lozenge in angle brackets: 〈◊〉) replaced by ellipses in parentheses ((…)).

With the above scheme we preserve as much textual information as possible. With caret replacements, VARD has the opportunity to standardize words that have illegible characters.

Future versions of our character cleaner may take advantage of the tilde (~) to help represent letters with macrons (ā to a~).

Our character cleaner also removes certain XML tags to give the flexibility of using VARD on TCP files in text or XML format.

  1. <SEG> tags of decorative initials — <SEG REND=”decorInit”>T</SEG>he
  2. Superscript — 13<sup>th</sup>
  3. Subscript — X<sub>2</sub>

  4. XML comments — <!– handkeyed by person –!>

Of course, a final caution for VARDing XML files: make sure the program processes only the text that you want it to. VARD automatically ignores XML tags. It’s still going to alter what is between those tags, especially in the HEADER of the XML file, which contains the metadata. To make sure VARD doesn’t change the metadata, add the following entries to VARD’s “text_to_ignore.txt” file in the setup folder (it contains the code for ignoring XML tags):

  1. (?s)<HEADER>.*</HEADER>
  2. (?s)<header>.*</header>
  3. (?s)<teiHEADER>.*</teiHEADER>
  4. (?s)<TEIHEADER>.*</TEIHEADER>
  5. (?s)<TEMPHEAD>.*</TEMPHEAD>

Why are there so many? Because coding practices are incredibly variable.

VARD & Character Encoding

Everything is always already encoded.

The first time I used VARD, discussed in my previous entry, it was a shiny toy, one with which I wanted to automatically process batches of TCP TEI P4 XML files. That was in February of this year. Since then, interactions I have had with VARD underscore the need to understand how the tools I use work.

The public releases of ECCO-TCP, EEBO-TCP, and Evans-TCP texts is a boon scholars, who can use these texts as the basis for their scholarship. Those who wish to create digital editions and computationally analyze TCP texts may turn to programs like VARD to assist with crucial file pre-processing, like spelling standardization. Given the use of unsupervised or semi-supervised curation methods, those who work with TCP texts must be transparent about how they access, process, and analyze their data sets. The decisions involved in data curation impact interpretations that can be drawn from TCP texts. This transparency demands responsibility on the part of the scholar to know how their methods and tools manipulate the data within TCP XML files.

The purpose of this entry is twofold. It is to make transparent decisions made by the VEP team to process TCP XML files. More importantly, it is to highlight how VARD has interacted with TCP files, providing a resource for scholars and the curious working with them. VARD’s default behavior has serious implications for the spelling standardization of TCP XML files.

(For those who wish to used text extracted from TCP XML files, the following is just as relevant.)

If you use VARD on TCP XML files, you must be extra careful with non-ASCII characters and XML tags. Even ASCII symbols create problems. TCP XML files contain a plethora of symbols, traces of their transcription. The symbols not only capture textual elements, like foreign alphabets and diacritical marks, but transcribers’ experiences with texts. There are symbols to communicate illegible characters (•), ambiguous dot-like punctuation (▪), end-of-line hyphens (∣), even end-of-line hyphens inserted by editors (¦). In the TEI P4 versions of the XML files, many of the symbols are inserted right into the body of the text. In the TEI P5 versions, however, XML tags take their place. (For the interested, here is the TCP character entity list.)

TEI P4: gentle Rea∣der
TEI P5: gentle Rea<g ref="char:EOLhyphen"/>der

Why is it such a problem? VARD doesn’t handle non-ASCII symbols, and TCP transcription frequently includes them within words. When XML tags and symbols occur within a word, VARD processes the characters on each side of the tags/symbols separately.

VARDThe image to the left demonstrates how VARD would automatically process wor|thinesse in both P4 and P5. It leaves wor alone, but suggests chinese as a replacement for thinenesse. (The picture also demonstrates how TEI P4 and P5 are different beasts.)

Symbols and XML tags are ubiquitous in TEI P4 and P5 versions of TCP XML files. They are integral to recreating early modern texts, rife with diacritical marks and foreign alphabets. To drive the point home, any letters with accents and macrons will not be recognized in VARD’s modern English dictionary. VARD will simply read characters on both sides of letters with diacritical marks as different words. XML tags create similar situations, just as diverse as non-ASCII symbols. They contain information from decorative initials to notes in the margin and the number of illegible characters.

Before you process TCP texts with any tool, know the encoded contents of those texts. Be sure to check how processing alters the files. You might be surprised.

The next will entry will discuss how the VEP team prepares TCP XML files for VARD, leveraging ASCII symbols during character cleanup and XML extraction.

Standardizing Early Modern Drama

We have made great progress with Jonathan Hope’s early modern drama corpus. It now includes plays dated up through 1700, built from TCP corpora. By my count, it is comprised of 1,257 plays. A corpus of this size and origin requires considerable curation. Beth Ralston has spearheaded metadata collection and cross-referencing–quite the feat–from Glasgow. In Madison, the VEP team has worked on extracting necessary text from TCP XML files. This effort involved writing and tweaking python scripts specifically for TEI P4 versions of the TCP offerings. Additionally, the team has consulted with Hope to attempt standardizing the corpus’s unwieldy early modern orthography.

To standardize the drama corpus, we are using VARD, a tool that aids spelling standardization within historical corpora. While VARD’s default is made to process Early Modern English texts, it achieves standardization by checking words against a modern dictionary. Therefore, using VARD necessitates modernization.

Though not an exhaustive list, modernization in VARD entails: changing Early Modern English second and third person verb endings to modern ones (-eth to -s); expanding elisions (o’er to over); changing variants of a word to a preferred form of that word (ope and op’n to open); joining separate words into a modern equivalent (him selfe into himselfe).

VARD standardizes 1-grams, that is, it evaluates one word at a time. Part of speech isn’t taken into account. VARD assesses a 1-gram against its dictionary (“words.txt”) and determines whether the 1-gram’s spelling is a non-variant or a variant. If the spelling matches a word in the dictionary, the 1-gram is marked as a non-variant and left alone. When a 1-gram’s spelling varies from what is in the dictionary, it is marked as a variant. VARD contains rules that manage how a variant is modernized, mainly a text file of modern spelling substitutions for early modern ones (“rules.txt”). This works at the level of the letter. For example, when VARD processes the word musick, a rule in the dictionary indicates that CK at the end of words can be replaced with just a C, enabling VARD to change musick to music. The rules allow for multiple variations based on a variant spelling to occur.

To decide which variant a non-variant spelling will be changed into, VARD performs word frequency calculations (f-scores). The calculation is a confidence score that weights how probable a variant replacement is. VARD keeps track of how many times it replaces words with specific variants. The f-score is a measure of the precision of a variant’s spelling and its recall (percentage based on how many times a specific variation has replaced the marked variant before). VARD has a normalization threshold that uses the confidence score to determine with which variant to replace a marked variant. That number is by default 50%. If a variant’s confidence score is above 50%, it will replace the early modern word under question. F-scores can be weighted to equally consider precision and recall, or it can be weighted to favor one or the other. In other words, VARD considers how to replace words according to what may be most correct or most probable based on how many times words occur within a corpus.

Through a GUI, VARD allows users to manually correct texts one at a time or batch process based on the normalization threshold. Users can also process texts through command line. Standardizing the Early Modern drama corpus has required both.

Quite a few posts re: the VARD saga will follow this one, as there are many aspects to explain about our curation experiences. A preview of post topics follow below:

  • Encoding issues before the TEI P4 XML files could be processed by VARD
  • Standardization principles across the corpus (elisions, pronouns, contractions)
  • Counteracting VARD’s questionable, if not amusing, word replacements
  • Generating aggressive spelling standardization rules

Following posts will contain documentation for the early modern drama corpus and tips/tricks for VARD.