Visualizing English Print’s Text Processing Pipeline Version 2.0
written by: Deidre Stuffer
To make the work of the Text Creation Partnership (TCP) more suitable for scalable computationally-driven analysis, VEP has designed a text processing pipeline that includes three important steps in the following order: character cleaning, text extraction, and spelling standardizaton. We have designed the character cleaning and text extraction processes to facilitate spelling standardization based on the composition of our source files. (Note: this page describes pipeline version 2.0. You can read about pipeline 1.0’s workflow here.)
Read about our new text processing pipeline below:
- What has changed between version 1.0 and 2.0?
- Character Cleaning
- Text Extraction
- Spelling Standardization
1. What has changed between pipeline 1.0 and 2.0?
The Visualizing English Print team decided not to use VARD to standardize spelling across the TCP texts for pipeline 2.0, as it did with pipeline 1.0. This decision does not discredit VARD, as it works well on smaller, highly curation corpora. While we noted that VARD performed well on standardizing the spelling in the VEP Core Drama 1660 corpus, training VARD to handle the heterogeneity and scale of the TCP (~61,000 texts of diverse subject matter) would have been a painstaking and time-consuming process.
Instead of using VARD, VEP decided to generate a dictionary of spelling standardizations to counter the time it would have taken to train VARD and verify its corrections at the scale of the TCP. However, VARD was invaluable to this process—the VEP team evaluated VARD’s most frequent corrections from the previous version of the corpus to serve as the base of the dictionary. Our dictionary’s strategy is to supply a standard spelling for the original spellings found in the TCP documents. The strategy makes direct replacements for available spellings, whereas VARD uses a system of statistical evaluation and weighting the potential spellings a word in question can have.
Furthermore, VARD uses a British dictionary for spelling standardization. It is therefore difficult to use VARD with most natural language processing (NLP) tools, as they tend to be programmed to recognize Standard American English, not British Standard English. A major goal of the VEP project is to produce machine-actionable corpora. Standardizing spelling variation in the TCP according to American English conventions enables NLP software to more efficiently analyze content in TCP texts.
VEP strives to make machine-actionable texts that preserve aspects of Early Modern printed English. The aspects we privilege are based on our collaborators’ needs. Our dictionary preserves Early Modern English second- and third-person singular verb endings, unlike VARD. VARD converts these endings to modern English equivalents, removing second-person endings and replacing third-person endings ((e)th > s). It would have been time-consuming to train VARD not only to recognize the words with these endings but to use a single standardization for the different spellings of a verb. To a point, our dictionary allows users to retain these archaic verb endings or replace them. You can learn more about our approach to spelling standardization and its limitations in the ‘Spelling Standardization‘ portion of this document.
2. Character Cleaning
The TCP provides richly annotated XML/SGML editions of digitized microfilm images of early English books. To encode the symbols encountered and how print is formatted on the pages of the digital images, the TCP transcriptions rely on a mixture of TEI markup and UTF-8 symbols. Character cleaning is an important first step to support text extraction and spelling standardization, as it reduces the variety of UTF-8 symbols and provides ASCII equivalents where possible. Our SimpleText format sacrifices the richness of TCP’s transcriptions to support algorithmic processing of texts’ language.
The script executes in two phases. The first standardizes UTF-8 characters to ASCII equivalents and assigned substitutions. It additionally removes specified XML tags and replaces them with assigned ASCII characters. The second phase strips the text of the pipes that the TCP uses for end-of-line hyphens and other specified output like unrecognized characters. Our goal is to maintain textual features that impact the representation of words in files, such as illegible characters and gaps of missing text.
3. Text Extraction
VEP’s text extraction script takes advantage of the flexibility afforded by TEI-compliant markup. The script utilizes a configuration file that directs whether to extract or ignore text between certain XML tags, such as character names contained in SPEAKER tags. Additionally, the configuration file indicates how to format text when generating plain text file representations from the TCP XML. Users can therefore extract the specific text that they want from the XML source files by modifying the configuration file.
VEP deploys two text extraction configurations, based on the needs of our collaborators. The default configuration extracts all text from XML documents, while the dramatic configuration extracts only the text that is meant to be spoken on the stage. The dramatic configuration is used on the Early Modern Drama collection.
The default configuration file was made by surveying all XML tag elements and their attributes within the TEI P4 versions of EEBO-TCP Phase I, EEBO-TCP Phase II, ECCO-TCP, and Evans-TCP. The result is a list of XML element tags with their attributes listed alphabetically, and directions on what to print and how were added. The drama configuration file is consequently a more restricted version of the default configuration. The script also fixes punctuation for spelling standardization. It adds spaces before and after the two hyphens (–) that represent the dash so both words can be recognized and evaluated.
1. Content within paragraph tags is followed by the space of two new lines.
2. Content within and head and line tags is followed by one new line, leaving no space between it and the following content.
1. Speech is represented by lines, with space separating the speech of one character from another.
The text files’ format reflects the structure of their XML source files rather than the printed text in the digitized microfilm images that served the basis of TCP transcriptions.
The format of the text files we release reflects the structure of their source files, not the structure of the books they were transcribed from. While our text files are meant to be machine readable above all, printing lines of a length of 80 characters helps human readers when they need to reference the texts.
4. Spelling Standardization
Spelling standardization is the process of mapping spelling variants of the same meaning to a single word token. For example, the word we know as “never” can be spelled in any of the following ways, never, neuer, ne’re, ne’er, ne’r, in the TCP XML files. Standardization ensures that all identified variant spellings become “never.” Ultimately, this standardization process often improves the accuracy of statistical analyses and reduces noise within the TCP as a dataset.
4.a SPELLING STANDARDIZATION PRINCIPLES
Collaborator Jonathan Hope contributed the following spelling standardization principles for VEP’s text processing pipeline.
- Standardize spellings to Standard American English equivalents (-our > -or; -re > -er; -ce > -se)
- Force Early Modern English orthographical conventions to Present Day English equivalents (vv > w; v > u; u > v; j > i)
- Expand elided adverbs (oft > often; ne’er > never)
- Expand elided prepositions (vpn > upon)
- Expand elided conjunctions (when’er > whenever)
- Expand pronoun contractions (I’le > I will; shee’s > she is; tis > it is)
- Expand elisions (pow’ring > powering; th’eagle > the eagle)
- Expand contractions (can’t > cannot; won’t > will not)
- Flag archaic second and third person singular verb endings (-(e)st & -(e)th) to allow script users to print either standardized archaic verbs or standardized Present Day English verbs
- Generate substitution rules to account for original spellings that needed to be combined or separated for standardization purposes (be gan > began; New-England > New England)
4.b RESEARCH METHODS
Research Assistant Deidre Stuffer researched spelling variation within the 61,000 TCP texts and generated the rules comprising VEP’s spelling standardization dictionary.
Deidre evaluated a list of the most common words that occur 2,000 or more times within a dataset of all TCP corpora, evaluating lists of high-priority targets to standardize. High priority targets afforded better standardization across texts. The targets included words with the following character patterns:
- vv (hovv > how)
- uu (diuulged > divulged)
- vs (vsual > usual)
- vn (fortvne > fortune)
- vt (vtmost > utmost)
- iu (endiue > endive)
- ‘d (cloth’d > clothed)
Overall, Deidre examined roughly 49,000 unique spellings of words (26,700 being the most common words and 22,500 high priority targets) to create a dictionary of about 20,000 standardization rules. Not all of the spellings required standardization, as the most common words already show consistent spelling (e.g., “the” and “though”) and do not require further intervention.
The 26,700 words that comprise the the most common words (occurring 2,000 and more times) in the TCP account for .3% of all unique words (5,828,072) in the TCP. Yet standardization based on these .3% of unique words alone provides a coverage ratio of 95.04% of the TCP’s 1,476,894,257 total words. The additional high priority targets increase this coverage ratio to 95.4%.
Simply put, for every hundred words on the unmodified page of each text, Deidre has made a judgment call on about 95 of them. Read more about how Deidre conducted her research in her blog entry, “Editing Programmatically.”
95% coverage is a reasonable compromise between performance and resources–the curve of diminishing returns for standardization is quite steep. To illustrate, to get a guaranteed 98% coverage, an additional 1,048,576 words would require research. Furthermore, 58% of the unique words in the TCP occur once. These words that occur once total to a staggering 3,386,964.
1. The standardizer makes corrections at the following scales:
- 1:1: betweene > between
- 1:N: free-will > free will
- N:1: be gan > began
- N:N: arke of noe > ark of Noah
2. Programmatic standardization at the scale of the TCP therefore sacrifices correctness. Replacements are made to get it right most of the time rather than every time. For example, the original spelling “hede” can signify “head” and “heed”. Most instances of the meaning “heed” appear after the word “take”, with “take heed” being a common phrase. The dictionary standardizes the substring of original spellings “take hede” and “tak hede” to “take heed”, but it standardizes “hede” without “take” to “head.” Naturally, not all standardizations affecting “head” and “heed” will be the correct signification, but the majority of the examples will be correct.
3. Standardizations of original spellings for nouns that end in –(e)s, which can be either a singular possessive form of the noun or a plural form of the noun, are always standardized to plural. This strategy lumps together both the singular possessive and plural nouns into one standardized spelling (e.g, hostes > hosts). In instances where Deidre found an original spelling clearly tended to the singular possessive in the TCP texts, she made rules for the dictionary to reflect the singular possessive (e.g., christs > christ’s).
4. Programmatically expanding contracted pronoun phrases unfortunately sacrifices some of the nuances of the pronoun system. Expanding phrases that involve the second-person pronouns “you” or “ye” have been changed to “you” by default (y’are > you are). The algorithm doesn’t use context and cannot distinguish between subject and object pronouns when expanding contracted pronoun phrases.
5. Certain frequent spellings have been forced to standardize to verbs and pronouns to get it right most of the time. These original spellings are “bee,” “doe,” and “wee”: (bee > be); (doe > do ); (wee > we). Naturally one can see the problems inherent to these forced standardizations. Read about the decision for these changes here, in a blog entry that describes the rationale for these changes in pipeline 1.0 when VEP used VARD for spelling standardization.
6. Deidre’s dictionary contains entries that attempt to fix separated word particles only for instances that were found during research (to morrow > tomorrow).
7. Because the dictionary original spellings that occur 2,000 or more times in the TCP texts, it means that verbs with archaic second- and first-person singular endings were flagged for standardization if they occurred 2,000 or more times.
8. While non-English languages are a regular feature in early printed English texts, our spelling replacements overwrite foreign words that share original spellings with English words. So the rule that changes “dyd” to “did” will also overwrite instances of the Welsh word ‘dyd’ meaning ‘day’.
9. Standardization is foremost meant to facilitate large-scale statistical analysis of the early printed record. The principles behind this corpus do not make it a good candidate for studying metricality.