Also see the GitHub repo for code, data, and explanations.
There are three scripts that handle character cleaning, text extraction, and spelling standardization.
characterCleaner prepares TCP XML files to facilitate spelling standardization later in the pipeline.
- replaces reserved XML characters (<, >, %) with at-signs (@)
- replaces ampersands (&) with the word “and”
- removes XML comments and TEI XML tags that can interrupt words: <SEG>, <SUB>, <SUP>
- transforms non-ASCII letters into ASCII alternatives (e.g., “naïve” to “naive”)
- makes the following character replacements:
- a dash (—) becomes two hyphens (–)
- TCP illegible characters (bullet: •) become carets (^)
- TCP unrecognizable punctuation (small black square: ▪) becomes asterisks (*)
- non-ASCII characters not assigned ASCII equivalents (e.g., pilcrow: ¶) with at signs (@)
- TCP missing word symbol (lozenge in brackets: ◊) become ellipses in parentheses ((…))
- deletes TCP end-of-line hyphen characters supplied during transcription (vertical bar: |, broken vertical bar: ¦)
- if desired, removes textual symbols converted to at signs (@)
- VEP Unicode Character Substitutions
- TCP Unicode Character Survey
tei-decoder flexibly eliminates XML tags, their attributes, and their content to produce plain text files from XML.
- uses a config file that determines what text to print to a new file
- TCP TEI-P4 XML Tag Survey
EMstandardizer standardizes Early Modern English spelling variation across the TCP corpora (EEBO-TCP, ECCO-TCP, and Evans-TCP inclusive) to facilitate scaleable computationally-driven analysis.
- uses a dictionary file (standardizer_dictionary.txt), researched and compiled by Deidre Stuffer, that provides a standard spelling for words in the TCP texts with recognizable spelling variation
- expands contracted pronouns, elided adverbs, and elided prepositions
- provides option to standardize early modern English 2nd and 3rd person singular verb endings to modern equivalents or archaic endings
- can generate annotation files that display which words have been standardized and their original spellings
- read more about spelling standardization on the workflow page
Known Issues with the Pipeline
1. End-Of-Line Hyphens: At the moment, our internal use analytical scripts do not recognize a word that straddles two lines (words with end-of-line-break hyphens, like “Interjec-/Tions”). Therefore, a hyphenated word straddling linebreaks is broken into two with the current algorithm. It recognizes “Interjec-/Tions” as the two words “interjec” and “tions”. Future iterations of our software will work to fix this discrepancy so when end-of-line hyphens occur they are not treated as a valid letter within a word. More research is needed to determine the scale of end-of-line hyphens in SimpleText plain text files, and whether hyphens at the end of lines in our source texts are meant to be hyphens or dashes. The TCP XML source files represent physical books published between 1470 and 1800, a period when printing practices varied.
The only exception to this rule is that our text processing software Ubiqu+Ity removes end-of-line hyphens in its analysis stage, treating “interjec-tions” as “interjections”.