Pipeline (OLD)

 

Warning: this page is out of data

Please see the page for the more modern pipeline.

VEP Scripts

div-divider

  • processes TEI-formatted XML files and extracts all DIV objects from a file, dividing them into independent files named after their DIV types
  • allows extraction of specific DIV types (e.g., “play”)
  • naming rule for output files: [name]_[global_DIV_No.]_[type]_[type_DIV_No.]_[level]

div-merger:

  • sequentially merges XML elements in an input folder and outputs an XML file that contains all of the XML elements within a <COLLECTION> tag
  • user must specify name for output file

pre-VARDer:

  • prepares TCP XML files for VARD
  • replaces XML reserved characters (<, >, %) with at signs (@)
  • replaces ampersands (&) with the word “and”
  • removes XML comments and TEI XML tags that can interrupt words: <SEG>, <SUB>, <SUP>
  • transforms non-ASCII characters into ASCII alternatives (e.g., “naïve” to “naive”)
  • replaces dashes (—) with two hyphens (–)
  • replaces TCP illegible characters (bullet: •) with carets (^)
  • replaces TCP unrecognizable punctuation (small black square: ▪) with asterisks (*)
  • replaces non-ASCII characters not assigned ASCII equivalents (e.g., pilcrow: ¶) with at signs (@)
  • replaces TCP missing word symbol (lozenge in brackets: ◊) with ellipses in parentheses ((…))
  • removes TCP end-of-line hyphen characters (vertical bar: |, broken vertical bar: ¦)
  • if desired, removes textual symbols converted to at signs (@)
  • VEP Unicode Character Substitutions
  • TCP Unicode Character Survey

tei-decoder:

  • flexibly eliminates XML tags, their attributes, and their content to produce plain text
  • uses a config file that specifies behavior for XML tags, i.e., determines what text to print to a new file
  • prints text in lines no longer than 80 characters
  • TCP TEI-P4 XML Tag Survey

VARD

Known Issues

1. Hyphenation Discrepancy that occurs in our analytical tools due to how the pipeline preserves end-of-line hyphens in SimpleText files

Explanation: Our plain text generation tools, to the extent they can, preserve the layout of source XML files when generating SimpleText plain text representations. As a result, SimpleText files preserve hyphens that occur at the end of lines in texts. For example, take this intentional, performative hyphen use in poetry from TCP file K012309.000:

Conjunctions, Prepositions, Interjec-
Tions, in blameful negligence. -- Ah!

At the moment, our internal use analytical scripts that generate n-grams treat hyphens at the end of a line as part of the word, in addition to the syllable(s) it is connected with on the next line. Therefore, “Interjec-Tions” is considered a word by the algorithm, different from “interjections”. Future iterations of our tools will work to fix this discrepancy so when end-of-line hyphens occur they are not treated as a valid letter within a word.

The tool Ubiqu+Ity removes end-of-line hyphens in its analysis stage, treating “interjec-tions” as “interjections”.