Warning: this page is out of data
Please see the page for the more modern pipeline.
VEP Scripts
div-divider
- processes TEI-formatted XML files and extracts all DIV objects from a file, dividing them into independent files named after their DIV types
- allows extraction of specific DIV types (e.g., “play”)
- naming rule for output files: [name]_[global_DIV_No.]_[type]_[type_DIV_No.]_[level]
div-merger:
- sequentially merges XML elements in an input folder and outputs an XML file that contains all of the XML elements within a <COLLECTION> tag
- user must specify name for output file
pre-VARDer:
- prepares TCP XML files for VARD
- replaces XML reserved characters (<, >, %) with at signs (@)
- replaces ampersands (&) with the word “and”
- removes XML comments and TEI XML tags that can interrupt words: <SEG>, <SUB>, <SUP>
- transforms non-ASCII characters into ASCII alternatives (e.g., “naïve” to “naive”)
- replaces dashes (—) with two hyphens (–)
- replaces TCP illegible characters (bullet: •) with carets (^)
- replaces TCP unrecognizable punctuation (small black square: ▪) with asterisks (*)
- replaces non-ASCII characters not assigned ASCII equivalents (e.g., pilcrow: ¶) with at signs (@)
- replaces TCP missing word symbol (lozenge in brackets: ◊) with ellipses in parentheses ((…))
- removes TCP end-of-line hyphen characters (vertical bar: |, broken vertical bar: ¦)
- if desired, removes textual symbols converted to at signs (@)
- VEP Unicode Character Substitutions
- TCP Unicode Character Survey
tei-decoder:
- flexibly eliminates XML tags, their attributes, and their content to produce plain text
- uses a config file that specifies behavior for XML tags, i.e., determines what text to print to a new file
- prints text in lines no longer than 80 characters
- TCP TEI-P4 XML Tag Survey
VARD
- software that standardizes Early Modern English spelling across corpora
- version 2.5.4
- trained by Deidre Stuffer on Jonathan Hope’s early modern drama corpora
- expands contracted pronouns, elided adverbs, and elided prepositions
- standardizes early modern English 2nd and 3rd person singular verb endings to modern equivalents
- Aggressive Rules
- Correcting Normalization Errors
- Forcing Standardization I
- Forcing Standardization II
Known Issues
1. Hyphenation Discrepancy that occurs in our analytical tools due to how the pipeline preserves end-of-line hyphens in SimpleText files
Explanation: Our plain text generation tools, to the extent they can, preserve the layout of source XML files when generating SimpleText plain text representations. As a result, SimpleText files preserve hyphens that occur at the end of lines in texts. For example, take this intentional, performative hyphen use in poetry from TCP file K012309.000:
Conjunctions, Prepositions, Interjec-
Tions, in blameful negligence. -- Ah!
At the moment, our internal use analytical scripts that generate n-grams treat hyphens at the end of a line as part of the word, in addition to the syllable(s) it is connected with on the next line. Therefore, “Interjec-Tions” is considered a word by the algorithm, different from “interjections”. Future iterations of our tools will work to fix this discrepancy so when end-of-line hyphens occur they are not treated as a valid letter within a word.
The tool Ubiqu+Ity removes end-of-line hyphens in its analysis stage, treating “interjec-tions” as “interjections”.