To make the work of the Text Creation Partnership (TCP) available for large-scale computational analysis, VEP designed a text-processing pipeline to support spelling standardization and flexible text extraction. For ease of use, VEP generates ASCII representations of TCP source files. The pipeline consequently reduces character variety so that corpora, in a lowest common denominator format, can be analyzed by tools other than our own.

Our pipeline includes three important steps in the following order: character cleaning, text extraction, and spelling standardization. We designed character cleaning and text extraction to facilitate spelling standardization based on the composition of our source files.

Character Cleaning

The TCP provides richly annotated XML/SGML files that represent the contents of digitized microfilm images of early English books. To encode the variety encountered in the digital images, the TCP transcriptions rely on a mixture of TEI markup and UTF-8 symbols. Character cleaning is an important first step to support text extraction and spelling standardization, as it reduces the variety of UTF-8 symbols and provides ASCII equivalents where possible. Our SimpleText format sacrifices the richness of TCP’s transcriptions to support algorithmic processing of texts’ language.

While we remove most traces of TCP editorial intervention found in the original source files, we maintain textual features that impact the representation of words in files, such as illegible characters and gaps of missing text. We modify source file content to support spelling standardization and avoid conflicts with XML reserved characters. The script executes in two phases. The first standardizes UTF-8 characters to ASCII equivalents and assigned substitutions. It additionally removes specified XML tags and replaces them with assigned ASCII characters. The second phase strips the text of the pipes that the TCP uses for end-of-line hyphens and other specified output like unrecognized characters.

Text Extraction

VEP’s text extraction script takes advantage of the flexibility that TEI markup affords. The script utilizes a configuration file that directs whether to extract or ignore text between certain XML tags, and how to format text when generating plain text file representations of TCP XML. As a result, users can extract the specific text that they want from the XML source files.

VEP deploys two text extraction configurations: default and dramatic (based on the needs of our collaborators). The default configuration extracts all text from XML documents, while the dramatic configuration extracts only the text that is meant to be spoken.

The default configuration file was made by surveying all XML tag elements and their attributes within the TEI P4 versions of EEBO-TCP Phase I, EEBO-TCP Phase II, ECCO-TCP, and Evans-TCP. The result is a list of XML element tags with their attributes listed alphabetically, and directions on what to print and how were added. The drama configuration file is a more restricted version of the default.

The text files’ format reflects the structure of their XML source files rather than the printed text in the digitized microfilm images that served the basis of TCP transcriptions. The script prints text in lines no longer than 80 characters.

Content within paragraph tags is followed by the space of two new lines.
Content within and head and line tags are followed by one new line, leaving no space between it and the following content.

Speech is represented by lines, with space separating the speech of one character from another.
The script also fixes punctuation for spelling standardization. It adds spaces before and after the two hyphens (–) that represent the dash so both words can be recognized and evaluated.

Spelling Standardization

The plain text files were processed through VARD 2.5.4. The version of VARD we deploy has been trained in the following manner specifically for the Early Modern Drama corpus.

The goal of spelling standardization is to facilitate computational analysis of early modern texts. With this goal, we have chosen to expand pronoun contractions, elided adverbs, and elided prepositions. By default, VARD is programmed to change verbs with singular second- and third-person endings (-(e)th, (e)st) to their modern equivalents.

See the following blog posts for more information:
Aggressive Rules
Correction Normalization Errors
Forcing Standardization I
Forcing Standardization II