Guest Post: Data-Mining King Lear

[I am pleased to offer this guest post by Darby Foster, a first year undergraduate student at Georgia Institute of Technology, majoring in Business Administration/Information Technology Management. Her professor, Dr Sarah Higanbotham, was kind enough to get in touch with me to share Darby’s final paper, which appears in a truncated form here. VEP loves hearing from students whose imaginations have been really taken by the work we do. –hgf]

Darby Foster
Georgia Institute of Technology

First Folio, Emory University, Nov. 2016

When I read King Lear, I became even more curious about this play’s language. The corpus analysis software Ubiqu+ity, allowed me to quantitatively analyze King Lear in terms of the play’s tragedy, trying to gain perspective on just how sad the play really is. My analysis provided substantial evidence against the claims of the literary critic, George Steiner, in terms of Shakespeare and the genre of tragedy. As a Business Administration/IT Management major, I was not overly eager to take an English Literature course, and especially not a Shakespeare course focusing on the 1623 First Folio. And yet I have never been (and perhaps will never be again) so excited about research as I was when I applied data mining to Shakespeare’s late tragedy, King Lear. It began with Michael Witmore’s podcast on data-mining Shakespeare, which inspired me to experiment with data-mining: first with Hamlet, using an online corpus analysis software Voyant to isolate word trends in Hamlet’s soliloquies. In particular, I traced Hamlet’s relative frequencies and found a predominance of comparisons (16 uses of the preposition “like”).

Most people define genre by its overall narrative structure. To a traditional close reader, genre is “a type of literary work characterized by a particular form, style, or purpose” (“Genre”). But to a computer, “genre is a coordinated set of having things and not having things” (Witmore 2011). Data-mining software takes texts/selections of text and counts the occurrences of specific words and phrases. Certain words play a key role in tragic drama in particular, including doubt, sense, nature, and fortune (Booth 1983, 37). DocuScope’s dictionary categorizes thousands of words into “Positivity,” “Negativity,” “Anger,” “Sad,” and so on. By choosing individual words in each category, I found it surprisingly easy to discover its genre.

Hope and Witmore 2004, 2010

Shakespeare’s 1623 First Folio divides the plays according to genre: comedies, histories, and tragedies. While the compilers of this collection of works likely used plot to separate the plays into genres, the same separation can be done using data-mining (Witmore 2011). Unfortunately, at this close level of analysis, the genre of tragedy can be difficult to distinguish. Data-mining software can easily delineate between comedies and histories, but tragedies lie somewhere in between these two genres (Hope and Witmore 2004). DocuScope, a sophisticated data-mining tool, counts the occurrences of specific categories of words and phrases in sections of text and creates graphs to display the findings in a visual manner. The following graph is a scatterplot of 1,000-word pieces of all of Shakespeare’s plays, color-coded based on genre (Witmore 2011). Green dots represent histories, red dots represent comedies, the orange dots correspond to tragedies, and the blue dots represent the late plays. This graph shows that what histories have, comedies lack, and vice versa, while tragedies are more in the middle of these two more defined genres. The patterns in the graph demonstrate that in addition to having similar plot structures and characters, Shakespeare’s plays within the same genre were clearly written with the same language and style.

One of Shakespeare’s most famous tragedies, King Lear, produces fascinating results when data-mined. DocuScope breaks up the text into over 100 categories of words. Each category contains thousands of words that were individually selected by David Kaufer, an English professor at Carnegie Mellon University, to fit a specific idea. One of the most prominent categories in the text of King Lear is “Negativity.” This category contains words such as death, curse, and torturous and corresponds to a total of 798 individual instances of negativity throughout the play (Ishizaki and Kaufer 2012). Such a strong presence of a single emotion greatly influences a work of literature. In this case, it also plays a big role in determining the genre of the play. Data-mining this play clearly reveals the play’s tragic nature.

Anyone who experiences King Lear can likewise tell that the play is a tragedy. From act one, scene one, it is evident that things are going downhill, as the king reveals his “darker purpose” to divide his kingdom into three parts, one for each of his daughters, so they can rule while he takes an “unburdened crawl toward death” (Shakespeare 1997, 1.1.43). From this point forward, the play is filled with pessimism, tragic events, and nihilism. Some argue that the decision to divide the kingdom is the true climax of the story, breaking the mold of traditional Shakespearean tragedies (Bowers 1980, 13). This structure allows no time for introducing the classic narrative fall of Lear; it brings the audience right into the middle of the story, which quickly becomes tragic. The two most loving and loyal characters in the play, Cordelia and Kent, are quickly banished. Not long after, Lear himself is banished from the homes of his daughters and sent out into a terrible storm (Shakespeare 1997, 2.4.295-353). The play becomes less tolerable to the audience as Lear’s mental capacity deteriorates. Rather than the tragedy building slowly over five acts, the audience experiences King Lear’s fall from 1.1. As the play progresses, there is still hope that conflict will be resolved and the protagonist will live on, but Shakespeare refuses to fulfil the desires of his audience (Booth 1983, 17). Cordelia’s death shocks everyone. “Enter Lear, with Cordelia in his arms, and the most terrifying five minutes in literature have begun” (Booth 1983, 11). The play ends, not with poetic justice, but with a father carrying the body of the virtuous young daughter whom he misjudged. And to intensify the tragedy, Lear himself dies just minutes later.

A quantitative perspective on King Lear provides similar results. When graphing the relative frequencies of specific types of language, patterns can be found in the data. An interesting example is with “Positivity,” which contains words and phrases such as trust, blessing, and hope. For example, “I pray you, sir, take patience: I have hope” (Shakespeare 1997, 2.4.130). While overall levels of negativity decrease as the play progresses, so do levels of positivity, which are almost always lower than the levels of negativity.

In the graph above, “Negativity” is represented in red and “Positivity” is represented in blue, over time. The diminishing positivity can be attributed to the nature of tragedy. As more and more tragic events occur, the scenes and characters are filled with less positivity. This increasing level of tragedy correlates to a steadily increasing level of overall sadness. While there are peaks and troughs on the graph of words categorized as “Sad,” the linear regression line shows an overall increase in sadness as the play goes on. This reflects the overall emotions of the characters in the play as well as the mood that is inflicted upon the audience during the tragedy. Language categorized as “Anger” also follows a similar pattern, increasing relatively as the play progresses. In this overlay of the two graphs, with DocuScope categories “Anger” in red and “Sad” in blue, note that the major peaks in both categories of word even somewhat align. These two emotions, anger and sadness, are clearly correlated in this play. Both are typically thought of as negative emotions, which are common in tragedies. When tragic events occur, natural responses often include sadness over what happened and anger that it did happen. In King Lear, characters often experience one or both anger and sadness as a result of something happening in their life.

Lear is Shakespeare’s most tragic play. It is possibly even “the most devastating tragic apprehension in the whole of Western dramatic literature” (Jackson 1996, 26). As Stephen Booth summarizes, “watching Lear is not unlike waiting for the death of a dying friend; our eagerness for the end makes the friend no less dear” (Booth 1983, 17). This very specific feeling captures the experience of King Lear; it is so depressingly tragic that all the audience wants is for the misery of the play to end. This type of incredibly sad tragedy can be categorized with its own name: absolute tragedy. Absolute tragedy “is immune to hope” (Steiner 2004, 4). It leaves no opportunity for the audience to believe that something good will come from all the negativity; it is unquestionably tragic. Such absolute tragedy “presents men and women who the gods torture and kill ‘for their sport’” (Steiner 2004, 11). This action is directly referenced in King Lear, when Gloucester and Edgar recognize late in the play, “As flies to wanton boys are we to th’ gods. They kill us for their sport” (Shakespeare 1997, 4.1.41-42). By this definition, King Lear aligns seamlessly with the definition of absolute tragedy.

Steiner disagrees. According to him, Shakespeare’s only absolute, and therefore most tragic, tragedy is Timon of Athens (Steiner 2004, 12). He argues that Timon’s utterly bleak plot and motifs make this play more tragic than the rest. A scan through DocuScope provides contrary results. In categories that are critical to the genre of tragedy, King Lear dominates. The chart on the right shows the percentage of each play that fits into the DocuScope categories of “Negativity,” “Positivity,” “Anger,” and “Sad.” These values show that King Lear is approximately 1.09 times more negative, 1.59 times sadder, and 1.02 times angrier than Timon of Athens, which also happens to be 1.08 times more positive than King Lear. Based on these metrics, King Lear clearly contains higher concentrations of words that are typically found in tragedies. This quantitative analysis provides a more precise technique for determining absolute tragedy, revealing that Lear is not only an absolute tragedy, but even more tragic than Timon of Athens.

Works Cited
Booth, Stephen. (1983). King Lear, Macbeth, Indefinition, and Tragedy.

Bowers, Fredson. (1980). “The Structure of King Lear.” Shakespeare Quarterly 31 (1): 7-20.

“Genre, N.” (2014) OED Online. Oxford University Press. Accessed February 7, 2017. http://www.oed.com/view/Entry/77629.

Hope, Jonathan and Michael Witmore. (2010). “The Hundredth Psalm to the Tune of ‘Green Sleeves’: Digital Approaches to Shakespeare’s Language of Genre.” Shakespeare Quarterly 61 (3): 357-90.

Hope, Jonathan, and Michael Witmore. (2004). “The Very Large Textual Object: A Prosthetic Reading of Shakespeare.” Early Modern Literary Studies 9 (12). Available online: purl.oclc.org/emls/09-3/hopewhit.htm.

Ishizaki, Suguru and David Kaufer. DocuScope Dictionary. Created 2012. Accessed 7 November 2016. Available online: github.com/docuscope/DocuScope-Dictionary-June-26-2012.

Jackson, Ester Merle. (1966). “King Lear: The Grammar of Tragedy.” Shakespeare Quarterly 17 (1): 25-40.

Shakespeare, William. (1997). King Lear. Ed. R.A. Foakes. London: Arden Shakespeare. Available online: http://shakespeare.mit.edu/lear/.

Steiner, George. (2004). “’Tragedy,’ Reconsidered.” New Literary History 35 (1): 1-15.

Witmore, Michael. Data-Mining Shakespeare. Created 2011. Accessed 7 September 2016. Available online: https://youtu.be/W1RsgUqFEeY.

Using the metadata builder to guide an analysis

As we’ve been releasing new resources for interacting with the TCP files, one of the questions that keeps coming up is “This is great, but what are we supposed to do with this stuff?” In this blog post I’m going to show how you can use the Core 1660 Drama corpus (from our Early Modern Drama collection) and the Metadata Builder to look at plays which didn’t explicitly involve Shakespeare as an author. I also wanted to cover a range of genre classifications (by any measure of genre), and were a manage size.

Using the Metadata Builder, I have the option to collect a variety of metadata from the master spreadsheet associated with the Core Drama corpus. As I want to study the texts freely available as part of the TCP, I select the ‘Unrestricted’ option in Step 1 rather than ‘All’. In this particular case, I am interested in play companies, so I want to ensure I get metadata which will supplement and guide my analysis of play companies. Therefore, I select the following categories in Step 2: TCP, ESTC, Wiggins Number, Author 1, Authors 2-5, Title, Genre, Wiggins Genre, DEEP Genre, Harbage Genre, Wiggins Contemporary Genre, Date of Writing, Date of first performance, Play Company 1, Play Company 2, and Theatre.[1] I could have downloaded more metadata, but these categories seemed most suited to guide an analysis of one particular play company. Looking at the metadata spreadsheet and paying specific attention to the Play Company 1 category, I settled on the Admiral’s Men, as it is inclusive of a diverse range of authors (including Munday, Dekker, Marlowe, Chapman and Peele) while remaining a manageable size (21 plays).

I then isolated the specific TCPIDs associated with each play-text belonging to the group I will now call ‘Admiral’s Men Plays’. Armed with this list, I copied these plays into a new folder to create a subcorpus of plays from the Core 1660 Drama Corpus. Here’s what that looked like:

Screen Shot 2016-11-03 at 2.43.39 pm

Having made decisions about what texts to analyse and moved the files around to create a corpus of Admiral’s Men Plays, I now can set up a multivariate linguistic analysis using Ubiqu+ity to observe some specific linguistic features. I’ve previously written about creating your own rules for Ubiqu+ity, but this time I want to use the standard DocuScope dictionary, which is a rich classification schema of the English language. While I may not necessarily agree with every decision made in what makes up the DocuScope categorization of the English language, it applies the same rules to every text it is given to analyze, which means that it counts the same features every time. Using the default settings on the Ubiqu+ity site, the system sends me a zipped folder of results. Included in this zipped folder is a comma-separated values spreadsheet which reports how much of each file measuring what percentage of each category makes up the whole of the file.

A selection of linguistic categories reported by docuscope.
A selection of linguistic categories reported by the DocuScope dictionary

Due to the nature of how DocuScope categorises language, some linguistic groupings are more likely to be in use than others. For example, FirstPerson (I, me, etc) is more frequent than Apology (sorry, apologies, etc) due to the nature of how language is understood to be distributed: the small boring words like I and me are far more frequent than more contentful words like ‘sorry’ or ‘apologies’ (This is part of a phenomenon called Zipf’s Law and you can read more about it here). You may also have noticed that the filenames use the anonymized TCPID numbers; you can cross-reference for titles using the metadata spreadsheet.

Due to the nature of how DocuScope categorises language, some linguistic groupings are more likely to be in use than others. For example, FirstPerson (I, me, etc) is more frequent than Apology (sorry, apologies, etc) due to the nature of how language is understood to be distributed: the small boring words like I and me are far more frequent than more contentful words like ‘sorry’ or ‘apologies’ (This is part of a phenomenon called Zipf’s Law and you can read more about it here). You may also have noticed that the filenames use the anonymized TCPID numbers; you can cross-reference for titles using the metadata spreadsheet.

With this report, it is also possible to conduct a variety of quantitative analyses.  Our colleagues have projected all of these categories into multidimensional space using Principle Component Analysis, but it is sometimes easier to focus on just a few features. By limiting the spreadsheet to only a handful of Language Action Types, the spreadsheet becomes far more manageable to work with. If I was interested in the category ‘Sad’, I could rank the spreadsheet using Excel’s sort function and see that Two Lamentable Tragedies has the highest quantity of ‘sad’ out of the subcorpus I have constructed. I can then see what other features are listed as highly-ranked for Two Lamentable Tragedies, or I can use I used the SlimTV TextViewer included in the downloaded Ubiqu+ity folder to identify other high-frequency linguistic categories for this play. With the SlimTV viewer, I can see that ‘negativity’ ‘intensity’, and ‘standards-positive’ are all highly ranked:

Negativity, Standards(Positive) and Intensity
Negativity, Standards(Positive) and Intensity are all highly ranked in Two Lamentable Tragedies

And that’s just a jumping off point: what other plays share this linguistic profile? From here I could compare how these specific features are used in other plays performed by the Admiral’s Men. But that’s still broad of a research question, so here are some more specific ones to follow up on: Do others plays in this corpus you made also rank high in those features? How about compared to the majority of the plays in the Core Drama 1660 corpus? Do the prevalence of negativity and intensity correlate to the acting style or material this group chose to perform?


[1] The metadata we have comes from several sources, including the Database of Early English Playbooks (DEEP) and Wiggins catalogues. We have also performed cross-checking between these two resources as well as including further reference to the ESTC and JISC Historic Texts, where necessary.

(this information has been taken in part from the VEP Core 1660 readme file [pdf])

TCP – lists the associated unique TCP ID number for the play in question

ESTC – Short Title Catalogue records, in case I need/want to find out more information about these texts

Wiggins Number – in case I want/need to reference the Wiggins catalogue for a particular play; based on Wiggins catalogues published to date

Author 1 – Primary author

Authors 2-5 – any other assigned authors, where applicable

Title – the title by which the play is commonly known – often the contemporary title.

Alternative Title – any other names the play could be known as. For example, the play known as “Volpone” is listed as “Volpone” for primary title, and “Volpone, or The Fox” is considered the secondary title. (We also use the ‘secondary title’ category to describe printed titles when they are different to performed titles.)

Genre – these are the genres originally assigned earlier in the project by Jonathan Hope – either Tragedy [TR], Tragicomedy [TC], Comedy [CO], History [HI], Masque [MA], Interlude [IN], Entertainment [EN], Dialogue [DI], or Non-Dramatic [ND]

Wiggins Genre – based on information from the Wiggins Catalogues (published to date)

DEEP genre – based on information from the Database of Early English Playbooks [link]

Harbage genre – from Harbage’s Annals of English Drama (1989)

Wiggins Contemporary Genre – genre classifications based on contemporary (modern) understandings of genre, taken from the Wiggins catalogues (based on those published thus far)

Date of writing – all texts have been given a date of writing. DEEP doesn’t have a date of writing column but sometimes offers a date range under ‘date of first production’, so in these instances the earliest date was taken for date of writing – if Wiggins offers a fixed date for date of writing then this was taken.

Date of first performance – when the play is understood to be first performed, if known

Play company 1 – The company of first production according to DEEP

Play company 2 – DEEP’s company attribution (where applicable)

Theatre – theatre and/or location of production, where available

 

 

 

Using the Metadata Builder: Getting the information that you want

Yesterday, Deidre wrote about the release of our new Metadata Builder, which collates lots of available information about materials included in the Text Creation Partnership transcriptions in one place. For each corpus available, you have the option of downloading metadata only for texts freely available in the public domain or metadata for texts both freely available and presently restricted, to be made available in the public domain in 2020 (we can distribute information about these restricted-access texts, but we can’t share the files). As a user of the Metadata Builder, I want to be able to take advantage of all the different metadata options available to supplement and guide my analyses. In this post I’ll walk you through a few ways of obtaining a couple different kinds of information using the various kinds information on offer. 

I happen to be interested in the language of dramatic writing. Visualizing English Print offers three different dramatic corpora: the Core Drama 1660 corpus, the Expanded Drama 1660 corpus, and the Expanded Drama 1700 Corpus. (Many scholars of Early Modern drama will be familiar with the Database of Early English Playbooks or DEEP as it is commonly known; this is in some ways quite similar). Of our dramatic corpora, the Expanded Drama 1700 corpus (ED1700) covers the largest quantity of dramatic writing, so I’ll use it as an example.

If I want all the metadata available for this corpus regardless of public-domain status, I would selected ‘All’ available texts in Step 1. However, if I want to use this metadata to guide decisions about a project I might prefer to use the ‘Unrestricted’ version of the corpus, as these texts are all freely available for download from our site.

First things first: to get all of our available metadata for either version of ED1700 specified in Step 1, select ‘all’ under every drop-down menu in Step 2. This is the “all you can eat” option: it will include every piece of metadata we have available, and from there you can download the spreadsheet and its associated readme file in Step 4 and 5. metadatabuilder-allWith everything, you can always further refine your downloaded spreadsheet, but I find it to be useful to keep one master spreadsheet pristine and do metadata manipulations, such as organising by author, date or other parameters on in a second version of the original spreadsheet.

While it is great to have everything, sometimes that can be too overwhelming. This post is therefore not meant to be a how-to guide but more of a ‘ways of thinking about the Metadata Builder’ guide. Here are a few of the metadata columns we offer which I personally find most useful. 

If you want to get the dedicated TCP ID number associated with each transcription, you’ll want to select the category ‘TCP’ from the the dropdown menu “Master Metadata”. These unique TCP identifiers match to a specific transcription: so the TCP identification number A01234 will always link to this specific document. TCP-noESTC data (including Wing numbers, where applicable) is available under the option ‘ESTC’.  

Under Master Metadata, we also offer information from the Wiggins Catalogues of British Drama, including their identification number schema and historical and contemporary generic assignments. Other useful generic information includes the DEEP genre and the Harbage genre, should you want to compare different understandings of genre forms over time or using various criteria to show variation in generic forms. I also I often want to know how many words are in each text, as this is a common way of describing how big or long a text is. This can be found under the Ubiq categories; select ‘# word tokens’ at the the very bottom of the Ubiq dropdown menu in Step 2.

The ability to group plays by company, based on information from the Wiggins Catalogues (using the options for Play Company 1 and Play Company 2 under Master Metadata in Step 2) means that I can easily organise an analysis using attributed information about working theatrical networks of the time and ask what makes the language of plays put on by the King’s Men different than, say, all other companies. With the options of including up to five authors as well, I can start to make these more complex analyses using multiple axis, such as asking only about single-authored plays performed by Queen Henrietta Maria’s Men that are over 20,000 words long.

What I am doing here is not limiting my corpus based on arbitrary features, but by selecting texts which fit certain parameters to get at more specific questions. The more features I pull in, the more information I can base my decisions around, but not all of these categories in Step 2 may be immediately useful. For example, I probably don’t need to know if there are figures (images) in these texts or how many pages long the texts originally were: that’s probably not going to help me. By excluding them from my spreadsheet, I am able to focus more on more relevant information (and if I decide I do want to know about it later, I can always get it from the all-metadata-spreadsheet I downloaded in the first instance).

Another thing I can do with the Metadata Builder is download Docuscope tagging statistics for every text in a specified corpus using the dropdown menu ‘Ubiq’ in Step 2. This means that I do not have to process the ED1700 Unrestricted corpus through Ubiqu+ity myself, but rather combine multiple metadata categories alongside the statistical distributions produced by the Docuscope text-tagging schema.  By selecting relevant metadata categories such as author(s) date of first performance, theatrical group, and several views of genre assignment, I am setting myself up for quite a nuanced multivariate analysis using these particular features.

Finally, the multivariate analyses I suggested above do not necessarily require the use of further computational methods. The ability to isolate all the texts based on a certain principle can guide any number of decisions for studies which rely on close-reading, such as identifying transcriptions which are have multi-lingual content and realising there is a text which you didn’t know about but has a clear connection to your previous work. The Metadata Builder therefore makes the ability to obtain a lot of information about a huge number of texts now available as a result of the TCP project. We look forward to what you will do with it!

VEP’s Metadata Builder

VEP’s Metadata Builder helps users navigate our corpora collections, which are vast in scale. The Metadata Builder provides metadata to users in an accessible, intelligible format, preventing users from having to decipher spreadsheets with more than 120 columns and 160,000 rows. This blog post will explain the motivations for creating the Metadata Builder and its main components, enabling users to understand how to use the tool.

Motivations

This tool was built to allows users to generate and download metadata spreadsheets of VEP corpora tailored to their own research interests. Users can merge metadata from multiple spreadsheets into one, filter their results, and engage with texts. The Metadata Builder provides options for downloading metadata and README documentation generated for tailor-made spreadsheets.

The Metadata Builder’s functionality is guided by data management best practices, which have vast implications for scholarly research. The builder ensures that users have access to the most up-to-date information.

Main Components

The Metadata Builder has five sequential steps:

  1. Pick the Documents
  2. Pick Metadata Fields
  3. Examine the Metadata
  4. Save the Metadata
  5. Save the README

1. PICK THE DOCUMENTS
The first step requires users to select the spreadsheet based on the VEP corpus they are interested in working with. The corpus spreadsheets contain information about corpus texts and files. Hovering over a corpus in the dropdown menu activates a tooltip displaying a brief corpus description. This description is replicated in the Step 1 box once a corpus is selected.

MB_step1

For each corpus, VEP offers two versions of metadata spreadsheets: ‘Unrestricted Only’ and ‘All’. To explain the difference, licensing agreements prevent VEP from releasing corpus text files made from restricted content in our source files. We can share, though, metadata about restricted files for research purposes. VEP provides users the option to download a spreadsheet of only the free files in a corpus (Unrestricted Only) or a spreadsheet of both free and restricted files (All). Unrestricted Only spreadsheets catalog the files available in our corpus downloads.

2. PICK THE METADATA FIELDS
This step of the Metadata Builder can be overwhelming for users not familiar with the data we provide to users. This step is the fun part, though, where users select specific information they are interested in.

To understand this step, users need to know that the Metadata Builder pulls information from multiple existing spreadsheets into one spreadsheet specified by the user. From these spreadsheets, users specify the exact columns they want to be dynamically generated as their dataset. Metadata spreadsheets are listed in bold print. Beside the metadata spreadsheet names are a question mark; hovering over it will display a brief description of the spreadsheet’s contents. Users can also view the spreadsheets by click on the source link to the right of the question mark, and the spreadsheet will appear in a new window. Dropdown menus to the right of the metadata spreadsheet list contains the names of all available columns from each linked spreadsheet, and hovering over the column names will display explanatory tooltips.

MB_step2

Below is a list that explains the metadata type spreadsheets you may see in step two:

  • Master Metadata: This spreadsheet contains the corpus metadata provided by the curator, from text name and author to genre and number of pages in the text.
  • Text Links: This spreadsheet contains links to files of unrestricted corpus content hosted on the VEP server. You can click on the links to read corpus texts.
  • Ubiq Categories: This spreadsheet contains DocuScope LAT information for the files in the selected corpus.
  • TCP Metadata: This spreadsheet contains metadata for all TCP digital texts, provided by the TCP.
    Non-English Language Metadata: This spreadsheet lists primary and secondary languages for texts in the TCP that have the least amount of recognizable English in them.
  • Figures-Per-Text Metadata: This spreadsheet lists the number of FIGURE XML tags that appear in each TCP XML text. It is useful for finding texts that contain tables and images.
  • Derived Date Metadata: This spreadsheet provides a programmatically selected date for all text in the TCP.

Users can select columns from as few as one metadata type spreadsheet to as many as columns from all the metadata type spreadsheets. Once users select metadata columns, a ‘Build Metadata’ button appears in the section. Pressing the button generates your data table in the section for step three.

3. EXAMINE AND CONFIRM THE METADATA
This section renders a dataset based on the specifications entered in section two. It allows users to sort and search the metadata.

MB_step3

4. SAVE THE METADATA
This section allows users to save the generated datasets in a variety of formats. Users can save their datasets as Excel spreadsheets or CSVs. Users can optionally copy all of the information to their clipboard or print it.

MB_step4

5. SAVE THE README
This section contains a red button labeled ‘Download README.’ This step is important. Corpus Builder dynamically generates a README file that explains the contents of datasets.

MB_step5

Stay tuned for an upcoming blog post by Heather that walks users through what she finds to be the Metadata Builder’s most useful features!

VEP Releases

VEP has been busy improving its visualization tools and processing pipeline! You can read about all the changes in the list below.

Release Information

  • Text Processing Pipeline 2.0 features better Unicode handling during character cleaning and a dictionary that standardizes spelling variation across TCP corpora. Read about the pipeline on the ‘Workflow’ page. Download the pipeline from GitHub.
  • TextDNA is available for download! The download includes sample datasets and Python scripts for curating your own TextDNA datasets. Download it from GitHub.
  • Ubiqu+Ity 1.2 is officially released! The SlimTV (or Slim TextViewer) replaces Ubiqu+Ity HTML files for navigating tagged text.
  • Updated corpora (processed with pipeline 2.0) are available for download.

Editing Programmatically; or, Curating ‘Big Data’ Literature Corpora

No one has time to read and really understand all of the 1,476,894,257 words that comprise the Text Creation Partnership (TCP) digital texts. Considering that adults read on average 300 words a minute, it would take someone about 40 years to read every word in the TCP’s 61,000 texts. That 40-year estimate assumes 52 40-hour work weeks per year—no vacations, no holiday time, no sick time, no lunches or breaks.

How, then, does one editorially intervene in literature datasets at such scale? This task isn’t best carried out by more traditional methods of editing, where a human reader scrutinizes each text word by word and makes local changes. In this post, I will describe the research process I used to create VEP’s dictionary for standardizing the early modern spelling variation captured in TCP texts.

The goal of spelling standardization is to map variant spellings to a standard spelling, like “neuer” and “ne’er” to “never”. This standardization reduces noise in the dataset, providing analytical gains. It makes statistical analysis more accurate for users interested in counting and weighing textual features, like word frequency and part of speech parsing.

To ensure spelling consistency across the dataset, I researched the most frequent original spellings in the TCP texts. The team in Wisconsin decided to aim for a 95% standardization guarantee. To guarantee 95% standardization meant that, to do research efficiently,  I had to research the most frequent words. As a result, I examined the behavior of 26,700 original spellings that occurred 2,000 or more times in the TCP. Their frequencies accounted for 95.04% of the total number of words in the TCP.

I searched for spellings in the TCP using command line.  (Using UNIX is preferable to loading the 61,000 TCP texts into concordance software, as this is a heavy task for GUI-aided text processing.)

spellingStandardizationResearch

I examined thousands and thousands of instances of original spellings in brief context. It was an exercise in brevity, a trade-off between time and human labor. More often than not, the searches returned enough text surrounding the original spelling to understand meaning. (For example, look at the screenshot of my research above: it is obvious when the original spelling “peeres” means “peers” and when it means “pears”.) It would have been too time consuming to open TCP text files and read a paragraph of context for the original spellings. You wouldn’t be able to make a judgment call on a word in a day.

I made the following decisions about original spellings:

  • not to standardize original spellings because they were in what we already recognize as a standard form (e.g., “the”)
  • not to standardize original spellings because they would have introduced too much error into the dataset
  • to standardize original spellings to a certain spelling that was correct most of the time based on the behavior of the spelling across the entire TCP

Standardizing the most frequent original spellings resulted in major payoffs. To illustrate, compare the corpus frequency and rank in corpus for the word “Jews” in the two tables below.

1-Gram in the TCP (Original Spelling)

n-gram corpus frequency rank in corpus
jews 154027 849

1-Gram in the TCP (Standardized Spelling)

n-gram corpus frequency rank in corpus
jews 315702 458

Standardization located 161,678 more instances of the word “Jews” in the TCP, which has vast implications for those who study religion in early modern texts. Standardization yielded a 104% gain in recognition. For the curious, here are original spellings that are standardized to “Jews” in the VEP dictionary: ievves, ievvs, iewes, iews, jevves, jevvs, and jewes.

Data standardization is a form of editorial intervention that can have vast impacts for users. Granted, the provided standardization is a first step, and I invite others to expand upon my work. However, I argue that enforcing spelling consistency in the TCP corpora provides users with a cleaner, more accessible dataset. The standardized spelling makes the dataset easier to search. Users do not need to be experts in Early Modern English spelling conventions to extract meaningful information.

Curious to see what my editorial intervention looks like? I annotated Act 1 Scene 1 from Pericles. The annotations are provided in an interactive Text Viewer. (Image of the Text Viewer is directly below, and the link to access the Text Viewer is at the end of this entry.)

PericlesAnnotatoins

We are providing an annotated SimpleText example of Act 1 Scene 1 from Pericles. (A preview of the annotations are in the image above.) The text is annotated with three tags: Standardized, Researched, and Justified. Standardized highlights words that have had their spelling standardized. If a standardized spelling is incorrect in the Pericles scene, the annotation will explain why within the context of the TCP corpus. Researched highlights words that were researched in the process of compiling the standardization dictionary. These words were not provided standardized spellings because 1) they were in a recognized form and 2) the original spelling had too many different meanings to be standardized. Justified highlights words that were not standardized and provides a reason why (e.g., the word’s frequency was lower than 2,000).

View Annotations for Standardized Spelling in Pericles Act 1 Scene 1

The Untranscribable in EEBO

As part of Visualising English Print, I have been evaluating and validating judgments about non-English print in the Text Creation Partnership transcriptions of EEBO. I’ve been looking at texts which have been classified as non-English (or texts that appear to be non-English, such as lists of names or places) by an automated text tagger. Bi- or multi-lingual text cause particular difficulties for this task, as a strong percentage of the text can still be in English but still pose problems by containing a relatively high percentage of untaggable words. (Inconsistent orthography is another big difficulty for this task, which is why VEP is working on improving the machine-readability of the TCP texts).

In the case of Early English Books Online, transcribers were given very specific instructions on how – and what – to transcribe. The TCP provides a map of every character in available in Unicode. This page is extremely thorough, covering a huge range of language and symbol character sets including print symbols (❧, ☞ ,⁂, ¶), alchemical symbols (♁, ♃, ℥, ☋), diacritics, and non-Latinate alphabetical symbols, including those associated with Greek, Hebrew, and Cyrillic. All of these characters therefore have the potential to be transcribed.

Characters which are considered part of the Classical Roman alphabet are retained, though there are few exceptions, such as when the source image is obfuscated by heavy inking or damage to the page. The TCP guidelines also include an entire section devoted to foreign (that is, non-Roman) alphabets. The entire document is linked here, but I’ve replicated the important part below.

  1. “Foreign” (non-Roman) alphabets. Extended text in a non-roman alphabet. Though individual letters (e.g. Greek or Hebrew letters used as manuscript sigla, symbols, reference marks, or abbreviations) should be recorded as special characters, using character entities (see discussion of Characters, below), entire words or extended passages in a non-Roman alphabet (Cyrillic, Hebrew, Greek, Arabic, etc.) should be recorded simply as <GAP DESC=”foreign”>, without transcribing the word(s) themselves. The tags cannot contain any text, though any notes, milestones, page-breaks, etc. that appear within the passage should be recorded as usual, using <GAP> tags before and after the interrupting milestones as necessary.

    Surrounding structures should be preserved if possible, at the highest level that applies. A line of verse quoted in Greek, for example, should be recorded as <Q><L><GAP DESC=”foreign”></L></Q>; a paragrah in Greek as <P><GAP DESC=”foreign”></P>; and a stanza in Greek as <LG><GAP DESC=”foreign”></LG>.

    example of mixed Greek-English textRecord as: the semicircle .18.5, <GAP DESC=”foreign”> .21.7, <GAP DESC=”foreign”> .23

  2. The presence of musical notation should be recorded with the <GAP> tag, with the value of the “DESC” attribute assigned as “music”: <GAP DESC=”music”>.

    Extened spans of music should be captured using a single <GAP> tag, so long as other material (such as text, illustrations, or a page-break) do not interrupt.

    Lyrics printed between lines of music should be recorded as ordinary prose. At every point at which the line of lyrics ends and a line or two of musical notation appears, insert within the running prose a <GAP DESC=”music”> tag.
    Any mathematical formulas or mathematical notation too complicated (or too dependent on two-dimensional layout) to be rendered as plain text should be recorded with the <GAP> tag, with the value of the “DESC” attribute assigned as “math.”

  3. Illegible text, missing and damaged text, or clear but unrecognized symbols all will require some attention from us.Illegible text that cannot be read, for whatever reason, should be marked using variations on the “$” symbol:
    $ = individual character or characters, less than a word.
    $word$ = a whole word
    $span$ = any span of two or more words, less than a page.
    $page$ = a whole page.
    Additional variants are possible if it proves useful to flag some other piece of the structure as unreadable, e.g.:
    $para$ = illegible paragraph
    $line$ = illegible line of verse or prose

    Unknown symbols or characters if they can be distinguished from illegible characters, should preferably be recorded as “#”.

    The illegibility threshhold. Two extremes should be avoided as far as possible: (1) using the illegibility markers promiscuously to avoid capturing text about which there is some difficulty; and (2) “creative” capture of text that really cannot be read, simply in order to avoid using the illegibility marker. We have prepared some examples of both overuse (EXAMPLE SET) and underuse (EXAMPLE SET 1; EXAMPLE SET 2; see also the bottom of SET 3) of the illegibility markers. It is admittedly not always easy to tell when a letter can be recognized with sufficient confidence to make its capture reliable.

This is primarily due to expedience: in order to capture lots of text quickly, transcribers were encouraged to give the most attention to Roman symbols, though glyphs like alchemical symbols are important for being able to read and understand the texts in question and are therefore retained.[1] Print symbols such as pilcrows are also commonly included in the transcriptions, as again they are useful for navigating the text. However, as Chris Powell and Paul Schnaffer very kindly confirmed for me,

The original instruction came to be modified in the course of actual practice. For one thing, majuscule Greek posed nothing like the same difficulties for capture and review as ligatured early modern Greek type did; in fact, it was difficult to think of a rule that would prevent the keyers from capturing it. So we let them do so, and attempted only to correct their tendency to capture Greek “A” as if it were Latin “A” etc., and so on for the other ambiguous glyphs. Lower-case, ligatured Greek we mostly left uncaptured, unless it served some structural purpose, e.g. was part of a title or chapter heading — or unless the TCP editor felt a rare and inexplicable impulse to type it in.

The end result of all this is that much, perhaps most, of the upper-case Greek has been captured, usually correctly, but that the vast majority of lower-case Greek has not been. (and a little bit of Hebrew, again when it served a structural purpose, and was essential for navigation through the book.) Math only when it could readily be represented as running text, or as a TEI/HTML table. Music only when it consisted of individual notes or symbols, no full tablature.

The implication here is that the final TCP texts are Unicode compliant, but only some of the characters from the full Unicode set make it into the TCP transcriptions. So Greek block letters (Γ, Σ, Φ, Ω) are sometimes transcribed, but the corresponding lowercase ligature symbols (γ, ς, φ, ω) are not.[2] Hebrew characters (א,ק,ש,מ) are very, very rarely transcribed, but Hebrew transliterated into Latinate characters will have been transcribed. Arabic is also notably absent, even though it is available in Unicode and is visible in several EEBO texts.

Using the JISC historical books (UK subscribing institutions) interface, a user is able to look at the TCP transcripts alongside the images from EEBO, making it possible to compare the print to the transcriptions. Here are some examples of what I mean when I say ‘untranscribed’.

This trilingual page from TCPID A90307 shows text in Latin, Greek script, and Hebrew:
A90307 trilingual transcribe issueAs we can see, the Greek and Hebrew print in this text are rendered as […], even though there is printed material there. A second example, a few pages later in the same book, is even more illustrative of this issue: the Arabic page shown below is listed as unavailable, even though there is definitely print there, and the next page, in Latin, is transcribed.

A90307 arabic problem

Even though they are available in the Unicode character set, these scripts are erased completely from the Text Creation Partnerships’ texts.

I now want to mention some exceptions which are notable. As I’ve stressed, very little of the Hebrew is transcribed unless it’s been transliterated into Latinate characters, such as in TCPID A37959, a three column book which offers translations between Latin, Hebrew and Welsh:

A57259 transliterated hebrew

Syriac and related languages rendered in Latinate characters are also transcribed.

And as Paul and Chris suggest, block Greek characters can be transcribed, as this page from TCPID A57729 shows:
A57729 greek block transcribed
but, as the TCP guidelines recommend, the Greek ligatures later down the page are not transcribed and are instead replaced with […]:
A57729 greek script untranscribed(The transcriptions do not retain the structure of the printed object!)

There’s also at least one fake alphabet in EEBO which is definitely not transcribed as there are no corresponding Unicode characters. But much of that book (TCPID A57259) is left untranscribed – though to its credit, […] is used for Arabic as opposed to pretending it’s not there at all. When present, Greek block characters in this book get transcribed too. It even contains this hyper-detailed transcription of a stylised alphabet, suggesting the definition of a Roman letter can be quite flexible:
Screen Shot 2016-09-23 at 5.42.04 pm

(Decorated initials are also recorded, in case you were curious.)

Of the languages discussed so far, Greek is by far the most common. Lots of evidence of historical printed Greek has been removed from the TCP corpus as a result of these rules. Although I haven’t seen a book printed entirely in Greek ligature yet, it wouldn’t be out of the question; plenty of texts I did look at contained lines, passages, paragraphs, or columns in ligature which are untranscribed and therefore eliminated from the corpus. Arabic and Hebrew are far less frequently found in EEBO, but it’s good to know they’re there. There may be other missing languages I don’t know about because they aren’t available in multilingual texts which are partially transcribed.

For our purposes, this is not necessarily a bad thing. VEP privileges English-language printed material in our pipeline, as our resources are designed to get high-accuracy ways of visualising, exploring and understanding English-language print from the TCP texts. At least 99% of Early English print is available in Roman characters, and even including Latin, we have a very high level of accuracy and coverage of the TCP data.

Individually, non-Roman glyphs may not represent massive amounts of early printed material, but in aggregate I’d estimate non-English print in EEBO is going to be represent maybe 1% of the entire EEBO-TCP corpus. This is still not a huge – or even meaningful – number, unless you study these languages, in which case it will be a big loss to you! But we will be releasing all the multi-lingual texts in the TCP collection soon, including some notes on languages that do not make it into the transcriptions, meaning you will soon be able to conduct your own investigations on these multi-lingual documents.

+++++++

[1] A small number of alchemical symbols are also available as emoji these days, and confusingly can render as emoji in the transcriptions. They are: ♈ (Aries), ♉ (Taurus), ♊ (Gemini), ♍ (Virgo), ♎ (Libra), ♏ (Scorpio), ♐ (Sagittarius), ♑ (Capricorn), ♒ (Aquarius), ♓ (Pisces), ♌ (Leo). A full list of Unicode-compliant alchemical symbols is available from https://en.wikipedia.org/wiki/Alchemical_symbol#Unicode

[2] For more on lowercase ligature characters in Greek, see the description of Aldine-style characters by Jane Raisch for the JHI: https://jhiblog.org/2016/08/29/greek-to-me-the-hellenism-of-early-print. This essay is an interesting discussion of Greek in Early Modern print more generally, and worth a read if you are interested in the how/why/what of Greek language printing.

Presenting at Yale’s Digital Humanities Lab

VEP’s own Heather Froehlich recently presented at the Yale Digital Humanities Lab! Here is what Heather has to say about her presentations:

On the kind invitation of Cathy DeRose, an alumnus of Visualising English Print, I was a visitor at the Yale Digital Humanities Lab last week. While there I gave two presentations: one, a paper about some of my research involving EEBO-TCP and the other a 3 hour masterclass on ways of using and accessing EEBO-TCP phase I. It was a real pleasure to spend a few days with the very keen members of the digital humanities community at Yale.

In the workshop, we primarily discussed what makes EEBO-TCP’s many entrypoints different to the Early English Books Online images in addition to best practices for accessing and using EEBO-TCP. My main goal was to highlight the fact that sure, you can download all the texts yourself, clean them up yourself, and then start the research process… or you can take advantage of a lot of hard work that others have done and start conducting your research without the stress of doing it all yourself. First I introduced the difference between the TCP transcriptions and the Chadwyck-Healy images, familarising participants with the online repository of EEBO-TCP transcriptions and exploring their relationship to searchable features users are already used to interacting with on the Chadwyck Healy search interface.

We also discussed and practiced using several front ends, including the CQPweb interface for EEBO-TCP and the BYU corpora’s incomplete version to identify potential variant spellings, as well as the Early Print Ngram viewer for EEBO-TCP phase I to trace most frequently used variants and concepts. We also discussed the benefits of using historical data such as the Historical Thesaurus of the Oxford English Dictionary, all ways that I and other members of Visualising English Print have used in our research.

Finally, to tie it all together, I also presented several case studies based on work done by my colleagues at the University of Strathclyde. The Super Science Corpus, something Strathclyde RA Alan Hogarth has been working on for the better part of a year, represents the world of Early Modern Scientific writing included in the Phase I release of the EEBO-TCP texts. With his help I was able to give some preliminary results about the relationship between philosophy of science and other scientific writing between 1482-1710. We also discussed work by Shota Kikuchi, a visiting scholar from the University of Tokyo at Strathclyde, which seeks to improve part of speech tagging accuracy by further modernising archaic constructions after the initial VARD process outlined here. For example, by modernising tis to it is, a part of speech tagger’s accuracy improves enough to make syntactic analysis more viable. And, last but certainly not least, I spoke about some of undergraduate student Rebecca Russell’s work with Jonathan Hope in the interdisciplinary Textlab course on the language of Shakespeare’s plays showing the potential for students to use these kinds of resources in a pedagogical context.

View presentation slides for Heather’s talk, ‘Things You Can Do with EEBO-TCP Phase I,’ and a supplementary handout.

View Heather’s post about her time at the Yale Digital Humanities Lab.

Shakespeare Annual Association Meeting 2016

Visualizing English Print is at the Shakespeare Annual Association Meeting in New Orleans! Jonathan Hope, Alan Hogarth, and I will be part of the digital exhibits on Thursday from 10:00 AM to 1:30 PM. Be sure to stop by! We’ll be offering demonstrations of new tools like TextDNA and of our early modern drama and early modern science corpora.