VEP’s Metadata Builder

VEP’s Metadata Builder helps users navigate our corpora collections, which are vast in scale. The Metadata Builder provides metadata to users in an accessible, intelligible format, preventing users from having to decipher spreadsheets with more than 120 columns and 160,000 rows. This blog post will explain the motivations for creating the Metadata Builder and its main components, enabling users to understand how to use the tool.

Motivations

This tool was built to allows users to generate and download metadata spreadsheets of VEP corpora tailored to their own research interests. Users can merge metadata from multiple spreadsheets into one, filter their results, and engage with texts. The Metadata Builder provides options for downloading metadata and README documentation generated for tailor-made spreadsheets.

The Metadata Builder’s functionality is guided by data management best practices, which have vast implications for scholarly research. The builder ensures that users have access to the most up-to-date information.

Main Components

The Metadata Builder has five sequential steps:

  1. Pick the Documents
  2. Pick Metadata Fields
  3. Examine the Metadata
  4. Save the Metadata
  5. Save the README

1. PICK THE DOCUMENTS
The first step requires users to select the spreadsheet based on the VEP corpus they are interested in working with. The corpus spreadsheets contain information about corpus texts and files. Hovering over a corpus in the dropdown menu activates a tooltip displaying a brief corpus description. This description is replicated in the Step 1 box once a corpus is selected.

MB_step1

For each corpus, VEP offers two versions of metadata spreadsheets: ‘Unrestricted Only’ and ‘All’. To explain the difference, licensing agreements prevent VEP from releasing corpus text files made from restricted content in our source files. We can share, though, metadata about restricted files for research purposes. VEP provides users the option to download a spreadsheet of only the free files in a corpus (Unrestricted Only) or a spreadsheet of both free and restricted files (All). Unrestricted Only spreadsheets catalog the files available in our corpus downloads.

2. PICK THE METADATA FIELDS
This step of the Metadata Builder can be overwhelming for users not familiar with the data we provide to users. This step is the fun part, though, where users select specific information they are interested in.

To understand this step, users need to know that the Metadata Builder pulls information from multiple existing spreadsheets into one spreadsheet specified by the user. From these spreadsheets, users specify the exact columns they want to be dynamically generated as their dataset. Metadata spreadsheets are listed in bold print. Beside the metadata spreadsheet names are a question mark; hovering over it will display a brief description of the spreadsheet’s contents. Users can also view the spreadsheets by click on the source link to the right of the question mark, and the spreadsheet will appear in a new window. Dropdown menus to the right of the metadata spreadsheet list contains the names of all available columns from each linked spreadsheet, and hovering over the column names will display explanatory tooltips.

MB_step2

Below is a list that explains the metadata type spreadsheets you may see in step two:

  • Master Metadata: This spreadsheet contains the corpus metadata provided by the curator, from text name and author to genre and number of pages in the text.
  • Text Links: This spreadsheet contains links to files of unrestricted corpus content hosted on the VEP server. You can click on the links to read corpus texts.
  • Ubiq Categories: This spreadsheet contains DocuScope LAT information for the files in the selected corpus.
  • TCP Metadata: This spreadsheet contains metadata for all TCP digital texts, provided by the TCP.
    Non-English Language Metadata: This spreadsheet lists primary and secondary languages for texts in the TCP that have the least amount of recognizable English in them.
  • Figures-Per-Text Metadata: This spreadsheet lists the number of FIGURE XML tags that appear in each TCP XML text. It is useful for finding texts that contain tables and images.
  • Derived Date Metadata: This spreadsheet provides a programmatically selected date for all text in the TCP.

Users can select columns from as few as one metadata type spreadsheet to as many as columns from all the metadata type spreadsheets. Once users select metadata columns, a ‘Build Metadata’ button appears in the section. Pressing the button generates your data table in the section for step three.

3. EXAMINE AND CONFIRM THE METADATA
This section renders a dataset based on the specifications entered in section two. It allows users to sort and search the metadata.

MB_step3

4. SAVE THE METADATA
This section allows users to save the generated datasets in a variety of formats. Users can save their datasets as Excel spreadsheets or CSVs. Users can optionally copy all of the information to their clipboard or print it.

MB_step4

5. SAVE THE README
This section contains a red button labeled ‘Download README.’ This step is important. Corpus Builder dynamically generates a README file that explains the contents of datasets.

MB_step5

Stay tuned for an upcoming blog post by Heather that walks users through what she finds to be the Metadata Builder’s most useful features!

VEP Releases

VEP has been busy improving its visualization tools and processing pipeline! You can read about all the changes in the list below.

Release Information

  • Text Processing Pipeline 2.0 features better Unicode handling during character cleaning and a dictionary that standardizes spelling variation across TCP corpora. Read about the pipeline on the ‘Workflow’ page. Download the pipeline from GitHub.
  • TextDNA is available for download! The download includes sample datasets and Python scripts for curating your own TextDNA datasets. Download it from GitHub.
  • Ubiqu+Ity 1.2 is officially released! The SlimTV (or Slim TextViewer) replaces Ubiqu+Ity HTML files for navigating tagged text.
  • Updated corpora (processed with pipeline 2.0) are available for download.

Editing Programmatically; or, Curating ‘Big Data’ Literature Corpora

No one has time to read and really understand all of the 1,476,894,257 words that comprise the Text Creation Partnership (TCP) digital texts. Considering that adults read on average 300 words a minute, it would take someone about 40 years to read every word in the TCP’s 61,000 texts. That 40-year estimate assumes 52 40-hour work weeks per year—no vacations, no holiday time, no sick time, no lunches or breaks.

How, then, does one editorially intervene in literature datasets at such scale? This task isn’t best carried out by more traditional methods of editing, where a human reader scrutinizes each text word by word and makes local changes. In this post, I will describe the research process I used to create VEP’s dictionary for standardizing the early modern spelling variation captured in TCP texts.

The goal of spelling standardization is to map variant spellings to a standard spelling, like “neuer” and “ne’er” to “never”. This standardization reduces noise in the dataset, providing analytical gains. It makes statistical analysis more accurate for users interested in counting and weighing textual features, like word frequency and part of speech parsing.

To ensure spelling consistency across the dataset, I researched the most frequent original spellings in the TCP texts. The team in Wisconsin decided to aim for a 95% standardization guarantee. To guarantee 95% standardization meant that, to do research efficiently,  I had to research the most frequent words. As a result, I examined the behavior of 26,700 original spellings that occurred 2,000 or more times in the TCP. Their frequencies accounted for 95.04% of the total number of words in the TCP.

I searched for spellings in the TCP using command line.  (Using UNIX is preferable to loading the 61,000 TCP texts into concordance software, as this is a heavy task for GUI-aided text processing.)

spellingStandardizationResearch

I examined thousands and thousands of instances of original spellings in brief context. It was an exercise in brevity, a trade-off between time and human labor. More often than not, the searches returned enough text surrounding the original spelling to understand meaning. (For example, look at the screenshot of my research above: it is obvious when the original spelling “peeres” means “peers” and when it means “pears”.) It would have been too time consuming to open TCP text files and read a paragraph of context for the original spellings. You wouldn’t be able to make a judgment call on a word in a day.

I made the following decisions about original spellings:

  • not to standardize original spellings because they were in what we already recognize as a standard form (e.g., “the”)
  • not to standardize original spellings because they would have introduced too much error into the dataset
  • to standardize original spellings to a certain spelling that was correct most of the time based on the behavior of the spelling across the entire TCP

Standardizing the most frequent original spellings resulted in major payoffs. To illustrate, compare the corpus frequency and rank in corpus for the word “Jews” in the two tables below.

1-Gram in the TCP (Original Spelling)

n-gram corpus frequency rank in corpus
jews 154027 849

1-Gram in the TCP (Standardized Spelling)

n-gram corpus frequency rank in corpus
jews 315702 458

Standardization located 161,678 more instances of the word “Jews” in the TCP, which has vast implications for those who study religion in early modern texts. Standardization yielded a 104% gain in recognition. For the curious, here are original spellings that are standardized to “Jews” in the VEP dictionary: ievves, ievvs, iewes, iews, jevves, jevvs, and jewes.

Data standardization is a form of editorial intervention that can have vast impacts for users. Granted, the provided standardization is a first step, and I invite others to expand upon my work. However, I argue that enforcing spelling consistency in the TCP corpora provides users with a cleaner, more accessible dataset. The standardized spelling makes the dataset easier to search. Users do not need to be experts in Early Modern English spelling conventions to extract meaningful information.

Curious to see what my editorial intervention looks like? I annotated Act 1 Scene 1 from Pericles. The annotations are provided in an interactive Text Viewer. (Image of the Text Viewer is directly below, and the link to access the Text Viewer is at the end of this entry.)

PericlesAnnotatoins

We are providing an annotated SimpleText example of Act 1 Scene 1 from Pericles. (A preview of the annotations are in the image above.) The text is annotated with three tags: Standardized, Researched, and Justified. Standardized highlights words that have had their spelling standardized. If a standardized spelling is incorrect in the Pericles scene, the annotation will explain why within the context of the TCP corpus. Researched highlights words that were researched in the process of compiling the standardization dictionary. These words were not provided standardized spellings because 1) they were in a recognized form and 2) the original spelling had too many different meanings to be standardized. Justified highlights words that were not standardized and provides a reason why (e.g., the word’s frequency was lower than 2,000).

View Annotations for Standardized Spelling in Pericles Act 1 Scene 1

Presenting at Yale’s Digital Humanities Lab

VEP’s own Heather Froehlich recently presented at the Yale Digital Humanities Lab! Here is what Heather has to say about her presentations:

On the kind invitation of Cathy DeRose, an alumnus of Visualising English Print, I was a visitor at the Yale Digital Humanities Lab last week. While there I gave two presentations: one, a paper about some of my research involving EEBO-TCP and the other a 3 hour masterclass on ways of using and accessing EEBO-TCP phase I. It was a real pleasure to spend a few days with the very keen members of the digital humanities community at Yale.

In the workshop, we primarily discussed what makes EEBO-TCP’s many entrypoints different to the Early English Books Online images in addition to best practices for accessing and using EEBO-TCP. My main goal was to highlight the fact that sure, you can download all the texts yourself, clean them up yourself, and then start the research process… or you can take advantage of a lot of hard work that others have done and start conducting your research without the stress of doing it all yourself. First I introduced the difference between the TCP transcriptions and the Chadwyck-Healy images, familarising participants with the online repository of EEBO-TCP transcriptions and exploring their relationship to searchable features users are already used to interacting with on the Chadwyck Healy search interface.

We also discussed and practiced using several front ends, including the CQPweb interface for EEBO-TCP and the BYU corpora’s incomplete version to identify potential variant spellings, as well as the Early Print Ngram viewer for EEBO-TCP phase I to trace most frequently used variants and concepts. We also discussed the benefits of using historical data such as the Historical Thesaurus of the Oxford English Dictionary, all ways that I and other members of Visualising English Print have used in our research.

Finally, to tie it all together, I also presented several case studies based on work done by my colleagues at the University of Strathclyde. The Super Science Corpus, something Strathclyde RA Alan Hogarth has been working on for the better part of a year, represents the world of Early Modern Scientific writing included in the Phase I release of the EEBO-TCP texts. With his help I was able to give some preliminary results about the relationship between philosophy of science and other scientific writing between 1482-1710. We also discussed work by Shota Kikuchi, a visiting scholar from the University of Tokyo at Strathclyde, which seeks to improve part of speech tagging accuracy by further modernising archaic constructions after the initial VARD process outlined here. For example, by modernising tis to it is, a part of speech tagger’s accuracy improves enough to make syntactic analysis more viable. And, last but certainly not least, I spoke about some of undergraduate student Rebecca Russell’s work with Jonathan Hope in the interdisciplinary Textlab course on the language of Shakespeare’s plays showing the potential for students to use these kinds of resources in a pedagogical context.

View presentation slides for Heather’s talk, ‘Things You Can Do with EEBO-TCP Phase I,’ and a supplementary handout.

View Heather’s post about her time at the Yale Digital Humanities Lab.

Shakespeare Annual Association Meeting 2016

Visualizing English Print is at the Shakespeare Annual Association Meeting in New Orleans! Jonathan Hope, Alan Hogarth, and I will be part of the digital exhibits on Thursday from 10:00 AM to 1:30 PM. Be sure to stop by! We’ll be offering demonstrations of new tools like TextDNA and of our early modern drama and early modern science corpora.

XML Tags in TCP TEI-P4 Files

Do you work with the TEI P4 versions of TCP XML files and wonder what all those tags mean? After surveying XML tags in TCP corpora, I made a spreadsheet that lists all of the tags, defines them, and mentions where you may find said tags within the XML documents.

Download the spreadsheet from here.

The survey and examining the files has made obvious that different TCP corpora have different levels of curation. The EEBO-TCP corpus has the most fleshed out metadata information. Not all tags are used across all corpora. According to my survey, Evans-TCP doesn’t use <FILEDESC>tags like EEBO-TCP and ECCO-TCP. Also, EEBO-TCP has the following tags that ECCO-TCP and Evans-TCP doesn’t: <AB>, <DEL>, <FW>, and <SUBST>.

Knowing all of the tags and what they are used for has been important for VEP’s methods, since we’re writing a script that grants users flexibility for text extraction from TCP TEI P4 XML files. It uses a configuration file that indicates text to extract and ignore between XML tags. For example, if you want to extract plays without their stage directions, the configuration file will allow you to ignore the <STAGE> tags that contain them.

For the spreadsheet I made, I obtained TAG definitions from TEI: Text Encoding Initiative.

Forcing Standardization in VARD, Part 2

The final aspect of standardization I will discuss will be common early modern spellings forced to modern equivalents, decisions where the payoff of consistency outweighs slight data loss.

The VEP team decided to force bee > be, doe > do, and wee > we.

Naturally one can see the problems inherent to these forced standardizations.

Bee in early modern spelling can stand for the insect as well as the verb. Similarly with doe, it can signify a deer or a verb. For wee, it can be either an adjective or a pronoun. We hypothesized for our drama corpus that 1) bee would be overwhelmingly the verb; 2) doe would overwhelmingly be the verb; 3) wee would overwhelmingly be the pronoun.

The decision to force these words was supported by sampling frequency of meanings in the early modern drama corpus, along with frequencies from Anupam Basu’s EEBO-TCP Key Words in Context tool set to original spelling, offered by Early Modern Print: Text Mining Early Printed English.

My method to determine meaning frequency is as follows:

  1. I searched for the first 1,000 instances of a spelling in the early modern corpus and Key Words in Context.
  2. I generated CSVs of the 1,000 hits of the spellings in question, including surrounding text to gain context and determine the word’s signification
  3. When I located a word that deviated from the meaning VEP projected the spelling would be associated with, I highlighted the entry and took notes in a column beside the line
  4. After I read through 1,000 instances of the spelling, I tallied the number of times the word did not match our hypothesized meaning.

BEE > BE

CORPUS INSTANCES OF INSECT INSTANCES OF SPELLING PERCENTAGE OF ERROR
EM Drama 17 1,000 1.7%
Key Words 71 1,000 7.1%

Bee as bee is higher in the first 1,000 hits of Key Words for generic reasons. Key Words contains all of EEBO-TCP. There are early dictionaries (Thomas Elyot) and husbandry texts (John Fitzherbert). Morever, compilers like George Gascoigne recognized the metaphorical power of the bee’s work–travelling from flower to flower to make sweet honey–and used it as meta-commentary for their labor gathering the most delightful and edifying writing.

DOE > DO

CORPUS INSTANCES OF ANIMAL INSTANCES OF SPELLING PERCENTAGE OF ERROR
EM Drama 0 1,000 0%
Key Words 1 1,000 .1%

I further looked into variant spellings of the conjugation does in the drama corpus, to see how common the animal would be opposed to the verb. Searching for does in the corpus yielded one instance of the animal in the first 1,000 instances of the spelling (.1%). Searching doe’s in the corpus yielded 158 instances of the spelling, all which were the verb.

The above results suggest minimal data loss for standardizing all instances of doe to do in the drama corpus.

WEE > WE
It is harder to pin down figures for this decision.

Searching for wee in the early modern drama corpus, I identified 4 of the first 1,000 instances that were not the pronoun. One looked like it should have been well, another looked like an elision of God be with yee (God b’wee). The remaining two instances were French, which standardized are to be oui.

Based on the first 1,000 instances of wee in Key Words in Context, there was too much noise. It seems that text you search in Key Words in Context doesn’t preserve TCP notation for illegible characters, the bullet (•). There were many places I had to look at the original TCP files to determine the signification of wee because the pronoun we and the adjective wee didn’t make sense. When consulting the file, I matched wee to words with illegible characters (e.g, we•e).

What do these standardizations mean for the drama corpus?
If you work on bee and deer imagery in early modern drama, you will want to look somewhere else. For the bee example, if the 17 in 1,000 instances of the spelling bee as insect holds steady over the 6,694 instances of bee in the drama corpus, that means ~113 of those 6,694 spellings of bee refer to the insect. Overall, with an error rate of 1.7%, data loss in the corpus is minimal when the spelling bee is forced to be.

Granted, I looked at the first 1,000 instances of spellings in the corpus and in Key Words. Consequently I reviewed inconsistent portions of these corpora. The VEP team decided the sampling was telling for the context of the drama corpus. Another inconsistency with the files is the order in which they were searched between Key Words and the drama corpus. Key Words doesn’t provide the user with options for ordering the results, therefore the words are displayed in chronological order. For the drama corpus, files were searched from smallest to largest TCP file number. Overall, the frequency of significations suggest small margins of error for the standardizations of bee, doe, and wee within the corpus.

Forcing Standardization in VARD, Part 1

Optimizing VARD for the early modern drama corpus required “forcing” lexical changes to create higher levels of standardization in the dataset. Jonathan Hope gave me editorial principles to follow as we considered what words/patterns VARD should change that it wasn’t. We wanted to standardize prepositions, expand elisions, and preserve verb endings. Unfortunately, preserving Early Modern verb endings (-st, –th) would require an overhaul of VARD’s dictionary.

There were three routes I followed to force standardization: manually selecting variants over others to change confidence scores; marking non-variants as variants and inputting their standardized form; adding words to the dictionary.

For the early modern drama corpus, the VEP team identified two grammatical features for forced standardization. We decided to implement consistent spelling for pronouns, adverbs, and prepositions; and expanding elisions that would interfere with algorithmic analysis, like topic modeling. Granted, more could have been changed, but we erred on the side of caution to see how effective the changes would be overall.

After documenting forced changes, I will discuss their implications for the dataset, which will come in the next entry.

RULES TO FORCE ELISION EXPANSION (read more here)
t’ to_ Start
th’ the_ Start

PRONOUNS AND CONTRACTIONS
hee > he
hir > her
ide > I’d
ile > I’ll
i’le > I’ll
she’s > she’s
shees > she’s
* wee > we

ADVERB CONTRACTIONS
heeres > here’s
heere’s > here’s
theres > there’s
ther’s > there’s
wheres > where’s
where’s > where’s

ADVERBS/PREPOSITIONS
aboue > above
ne’er > never
ne’re > never
nev’r > never
o’er > over
oe’r > over
ope > open
op’n > open

WORDS ADDED TO DICTIONARY: Cupid, Damon, Leander, Mathias, nunc, Paul’s, Piso, qui, quod, tis, twas, twere, twould

MARKED AS VARIANTS FOR CORRECTION: greene > green, lockes > locks, vs > us, wilde > wild

* I will discuss the implications of our decision for wee in the next entry.

VARD Normalization Errors

VARD decently standardizes Early Modern English. Sometimes, though, it makes questionable replacements.

ORIGINAL NORMALIZATION SHOULD BE
all’s ell’s all’s
caus’d cause caused
Cicilia Cicely Cicilia
courtesie curtsy courtesy
diuers divers diverse
hir his her
ile isle I’ll
ist first is’t
kild
killd
kilt killed
maister moister master
maist moist mayst
*nunc nuns nunc
Pauls Pals Paul’s
*qui queen qui
shees shoes she’s
she is
weele weal we’ll
we will
where’s whores where’s
where is

Of course, you will want to check how your VARD installation handles these words. VARD keeps a running list of changes it makes, which silently trains the program as it executes. It is good practice to examine what VARD changes certain words to. Your datasets will be different than mine. These changes are based on blatant errors the VEP team located in the early modern drama corpus. Datasets, based on their content, have different curation needs.

*The dataset we use is interspersed with different languages, especially Latin. I had to add foreign words–like nunc and qui–to the dictionary to prevent skewing the frequency of certain vocabulary.