Blog

Guest Post: Data-Mining King Lear

[I am pleased to offer this guest post by Darby Foster, a first year undergraduate student at Georgia Institute of Technology, majoring in Business Administration/Information Technology Management. Her professor, Dr Sarah Higanbotham, was kind enough to get in touch with me to share Darby’s final paper, which appears in a truncated form here. VEP loves hearing from students whose imaginations have been really taken by the work we do. –hgf]

Darby Foster
Georgia Institute of Technology

First Folio, Emory University, Nov. 2016

When I read King Lear, I became even more curious about this play’s language. The corpus analysis software Ubiqu+ity, allowed me to quantitatively analyze King Lear in terms of the play’s tragedy, trying to gain perspective on just how sad the play really is. My analysis provided substantial evidence against the claims of the literary critic, George Steiner, in terms of Shakespeare and the genre of tragedy. As a Business Administration/IT Management major, I was not overly eager to take an English Literature course, and especially not a Shakespeare course focusing on the 1623 First Folio. And yet I have never been (and perhaps will never be again) so excited about research as I was when I applied data mining to Shakespeare’s late tragedy, King Lear. It began with Michael Witmore’s podcast on data-mining Shakespeare, which inspired me to experiment with data-mining: first with Hamlet, using an online corpus analysis software Voyant to isolate word trends in Hamlet’s soliloquies. In particular, I traced Hamlet’s relative frequencies and found a predominance of comparisons (16 uses of the preposition “like”).

Most people define genre by its overall narrative structure. To a traditional close reader, genre is “a type of literary work characterized by a particular form, style, or purpose” (“Genre”). But to a computer, “genre is a coordinated set of having things and not having things” (Witmore 2011). Data-mining software takes texts/selections of text and counts the occurrences of specific words and phrases. Certain words play a key role in tragic drama in particular, including doubt, sense, nature, and fortune (Booth 1983, 37). DocuScope’s dictionary categorizes thousands of words into “Positivity,” “Negativity,” “Anger,” “Sad,” and so on. By choosing individual words in each category, I found it surprisingly easy to discover its genre.

Hope and Witmore 2004, 2010

Shakespeare’s 1623 First Folio divides the plays according to genre: comedies, histories, and tragedies. While the compilers of this collection of works likely used plot to separate the plays into genres, the same separation can be done using data-mining (Witmore 2011). Unfortunately, at this close level of analysis, the genre of tragedy can be difficult to distinguish. Data-mining software can easily delineate between comedies and histories, but tragedies lie somewhere in between these two genres (Hope and Witmore 2004). DocuScope, a sophisticated data-mining tool, counts the occurrences of specific categories of words and phrases in sections of text and creates graphs to display the findings in a visual manner. The following graph is a scatterplot of 1,000-word pieces of all of Shakespeare’s plays, color-coded based on genre (Witmore 2011). Green dots represent histories, red dots represent comedies, the orange dots correspond to tragedies, and the blue dots represent the late plays. This graph shows that what histories have, comedies lack, and vice versa, while tragedies are more in the middle of these two more defined genres. The patterns in the graph demonstrate that in addition to having similar plot structures and characters, Shakespeare’s plays within the same genre were clearly written with the same language and style.

One of Shakespeare’s most famous tragedies, King Lear, produces fascinating results when data-mined. DocuScope breaks up the text into over 100 categories of words. Each category contains thousands of words that were individually selected by David Kaufer, an English professor at Carnegie Mellon University, to fit a specific idea. One of the most prominent categories in the text of King Lear is “Negativity.” This category contains words such as death, curse, and torturous and corresponds to a total of 798 individual instances of negativity throughout the play (Ishizaki and Kaufer 2012). Such a strong presence of a single emotion greatly influences a work of literature. In this case, it also plays a big role in determining the genre of the play. Data-mining this play clearly reveals the play’s tragic nature.

Anyone who experiences King Lear can likewise tell that the play is a tragedy. From act one, scene one, it is evident that things are going downhill, as the king reveals his “darker purpose” to divide his kingdom into three parts, one for each of his daughters, so they can rule while he takes an “unburdened crawl toward death” (Shakespeare 1997, 1.1.43). From this point forward, the play is filled with pessimism, tragic events, and nihilism. Some argue that the decision to divide the kingdom is the true climax of the story, breaking the mold of traditional Shakespearean tragedies (Bowers 1980, 13). This structure allows no time for introducing the classic narrative fall of Lear; it brings the audience right into the middle of the story, which quickly becomes tragic. The two most loving and loyal characters in the play, Cordelia and Kent, are quickly banished. Not long after, Lear himself is banished from the homes of his daughters and sent out into a terrible storm (Shakespeare 1997, 2.4.295-353). The play becomes less tolerable to the audience as Lear’s mental capacity deteriorates. Rather than the tragedy building slowly over five acts, the audience experiences King Lear’s fall from 1.1. As the play progresses, there is still hope that conflict will be resolved and the protagonist will live on, but Shakespeare refuses to fulfil the desires of his audience (Booth 1983, 17). Cordelia’s death shocks everyone. “Enter Lear, with Cordelia in his arms, and the most terrifying five minutes in literature have begun” (Booth 1983, 11). The play ends, not with poetic justice, but with a father carrying the body of the virtuous young daughter whom he misjudged. And to intensify the tragedy, Lear himself dies just minutes later.

A quantitative perspective on King Lear provides similar results. When graphing the relative frequencies of specific types of language, patterns can be found in the data. An interesting example is with “Positivity,” which contains words and phrases such as trust, blessing, and hope. For example, “I pray you, sir, take patience: I have hope” (Shakespeare 1997, 2.4.130). While overall levels of negativity decrease as the play progresses, so do levels of positivity, which are almost always lower than the levels of negativity.

In the graph above, “Negativity” is represented in red and “Positivity” is represented in blue, over time. The diminishing positivity can be attributed to the nature of tragedy. As more and more tragic events occur, the scenes and characters are filled with less positivity. This increasing level of tragedy correlates to a steadily increasing level of overall sadness. While there are peaks and troughs on the graph of words categorized as “Sad,” the linear regression line shows an overall increase in sadness as the play goes on. This reflects the overall emotions of the characters in the play as well as the mood that is inflicted upon the audience during the tragedy. Language categorized as “Anger” also follows a similar pattern, increasing relatively as the play progresses. In this overlay of the two graphs, with DocuScope categories “Anger” in red and “Sad” in blue, note that the major peaks in both categories of word even somewhat align. These two emotions, anger and sadness, are clearly correlated in this play. Both are typically thought of as negative emotions, which are common in tragedies. When tragic events occur, natural responses often include sadness over what happened and anger that it did happen. In King Lear, characters often experience one or both anger and sadness as a result of something happening in their life.

Lear is Shakespeare’s most tragic play. It is possibly even “the most devastating tragic apprehension in the whole of Western dramatic literature” (Jackson 1996, 26). As Stephen Booth summarizes, “watching Lear is not unlike waiting for the death of a dying friend; our eagerness for the end makes the friend no less dear” (Booth 1983, 17). This very specific feeling captures the experience of King Lear; it is so depressingly tragic that all the audience wants is for the misery of the play to end. This type of incredibly sad tragedy can be categorized with its own name: absolute tragedy. Absolute tragedy “is immune to hope” (Steiner 2004, 4). It leaves no opportunity for the audience to believe that something good will come from all the negativity; it is unquestionably tragic. Such absolute tragedy “presents men and women who the gods torture and kill ‘for their sport’” (Steiner 2004, 11). This action is directly referenced in King Lear, when Gloucester and Edgar recognize late in the play, “As flies to wanton boys are we to th’ gods. They kill us for their sport” (Shakespeare 1997, 4.1.41-42). By this definition, King Lear aligns seamlessly with the definition of absolute tragedy.

Steiner disagrees. According to him, Shakespeare’s only absolute, and therefore most tragic, tragedy is Timon of Athens (Steiner 2004, 12). He argues that Timon’s utterly bleak plot and motifs make this play more tragic than the rest. A scan through DocuScope provides contrary results. In categories that are critical to the genre of tragedy, King Lear dominates. The chart on the right shows the percentage of each play that fits into the DocuScope categories of “Negativity,” “Positivity,” “Anger,” and “Sad.” These values show that King Lear is approximately 1.09 times more negative, 1.59 times sadder, and 1.02 times angrier than Timon of Athens, which also happens to be 1.08 times more positive than King Lear. Based on these metrics, King Lear clearly contains higher concentrations of words that are typically found in tragedies. This quantitative analysis provides a more precise technique for determining absolute tragedy, revealing that Lear is not only an absolute tragedy, but even more tragic than Timon of Athens.

Works Cited
Booth, Stephen. (1983). King Lear, Macbeth, Indefinition, and Tragedy.

Bowers, Fredson. (1980). “The Structure of King Lear.” Shakespeare Quarterly 31 (1): 7-20.

“Genre, N.” (2014) OED Online. Oxford University Press. Accessed February 7, 2017. http://www.oed.com/view/Entry/77629.

Hope, Jonathan and Michael Witmore. (2010). “The Hundredth Psalm to the Tune of ‘Green Sleeves’: Digital Approaches to Shakespeare’s Language of Genre.” Shakespeare Quarterly 61 (3): 357-90.

Hope, Jonathan, and Michael Witmore. (2004). “The Very Large Textual Object: A Prosthetic Reading of Shakespeare.” Early Modern Literary Studies 9 (12). Available online: purl.oclc.org/emls/09-3/hopewhit.htm.

Ishizaki, Suguru and David Kaufer. DocuScope Dictionary. Created 2012. Accessed 7 November 2016. Available online: github.com/docuscope/DocuScope-Dictionary-June-26-2012.

Jackson, Ester Merle. (1966). “King Lear: The Grammar of Tragedy.” Shakespeare Quarterly 17 (1): 25-40.

Shakespeare, William. (1997). King Lear. Ed. R.A. Foakes. London: Arden Shakespeare. Available online: http://shakespeare.mit.edu/lear/.

Steiner, George. (2004). “’Tragedy,’ Reconsidered.” New Literary History 35 (1): 1-15.

Witmore, Michael. Data-Mining Shakespeare. Created 2011. Accessed 7 September 2016. Available online: https://youtu.be/W1RsgUqFEeY.

Using the metadata builder to guide an analysis

As we’ve been releasing new resources for interacting with the TCP files, one of the questions that keeps coming up is “This is great, but what are we supposed to do with this stuff?” In this blog post I’m going to show how you can use the Core 1660 Drama corpus (from our Early Modern Drama collection) and the Metadata Builder to look at plays which didn’t explicitly involve Shakespeare as an author. I also wanted to cover a range of genre classifications (by any measure of genre), and were a manage size.

Using the Metadata Builder, I have the option to collect a variety of metadata from the master spreadsheet associated with the Core Drama corpus. As I want to study the texts freely available as part of the TCP, I select the ‘Unrestricted’ option in Step 1 rather than ‘All’. In this particular case, I am interested in play companies, so I want to ensure I get metadata which will supplement and guide my analysis of play companies. Therefore, I select the following categories in Step 2: TCP, ESTC, Wiggins Number, Author 1, Authors 2-5, Title, Genre, Wiggins Genre, DEEP Genre, Harbage Genre, Wiggins Contemporary Genre, Date of Writing, Date of first performance, Play Company 1, Play Company 2, and Theatre.[1] I could have downloaded more metadata, but these categories seemed most suited to guide an analysis of one particular play company. Looking at the metadata spreadsheet and paying specific attention to the Play Company 1 category, I settled on the Admiral’s Men, as it is inclusive of a diverse range of authors (including Munday, Dekker, Marlowe, Chapman and Peele) while remaining a manageable size (21 plays).

I then isolated the specific TCPIDs associated with each play-text belonging to the group I will now call ‘Admiral’s Men Plays’. Armed with this list, I copied these plays into a new folder to create a subcorpus of plays from the Core 1660 Drama Corpus. Here’s what that looked like:

Screen Shot 2016-11-03 at 2.43.39 pm

Having made decisions about what texts to analyse and moved the files around to create a corpus of Admiral’s Men Plays, I now can set up a multivariate linguistic analysis using Ubiqu+ity to observe some specific linguistic features. I’ve previously written about creating your own rules for Ubiqu+ity, but this time I want to use the standard DocuScope dictionary, which is a rich classification schema of the English language. While I may not necessarily agree with every decision made in what makes up the DocuScope categorization of the English language, it applies the same rules to every text it is given to analyze, which means that it counts the same features every time. Using the default settings on the Ubiqu+ity site, the system sends me a zipped folder of results. Included in this zipped folder is a comma-separated values spreadsheet which reports how much of each file measuring what percentage of each category makes up the whole of the file.

A selection of linguistic categories reported by docuscope.
A selection of linguistic categories reported by the DocuScope dictionary

Due to the nature of how DocuScope categorises language, some linguistic groupings are more likely to be in use than others. For example, FirstPerson (I, me, etc) is more frequent than Apology (sorry, apologies, etc) due to the nature of how language is understood to be distributed: the small boring words like I and me are far more frequent than more contentful words like ‘sorry’ or ‘apologies’ (This is part of a phenomenon called Zipf’s Law and you can read more about it here). You may also have noticed that the filenames use the anonymized TCPID numbers; you can cross-reference for titles using the metadata spreadsheet.

Due to the nature of how DocuScope categorises language, some linguistic groupings are more likely to be in use than others. For example, FirstPerson (I, me, etc) is more frequent than Apology (sorry, apologies, etc) due to the nature of how language is understood to be distributed: the small boring words like I and me are far more frequent than more contentful words like ‘sorry’ or ‘apologies’ (This is part of a phenomenon called Zipf’s Law and you can read more about it here). You may also have noticed that the filenames use the anonymized TCPID numbers; you can cross-reference for titles using the metadata spreadsheet.

With this report, it is also possible to conduct a variety of quantitative analyses.  Our colleagues have projected all of these categories into multidimensional space using Principle Component Analysis, but it is sometimes easier to focus on just a few features. By limiting the spreadsheet to only a handful of Language Action Types, the spreadsheet becomes far more manageable to work with. If I was interested in the category ‘Sad’, I could rank the spreadsheet using Excel’s sort function and see that Two Lamentable Tragedies has the highest quantity of ‘sad’ out of the subcorpus I have constructed. I can then see what other features are listed as highly-ranked for Two Lamentable Tragedies, or I can use I used the SlimTV TextViewer included in the downloaded Ubiqu+ity folder to identify other high-frequency linguistic categories for this play. With the SlimTV viewer, I can see that ‘negativity’ ‘intensity’, and ‘standards-positive’ are all highly ranked:

Negativity, Standards(Positive) and Intensity
Negativity, Standards(Positive) and Intensity are all highly ranked in Two Lamentable Tragedies

And that’s just a jumping off point: what other plays share this linguistic profile? From here I could compare how these specific features are used in other plays performed by the Admiral’s Men. But that’s still broad of a research question, so here are some more specific ones to follow up on: Do others plays in this corpus you made also rank high in those features? How about compared to the majority of the plays in the Core Drama 1660 corpus? Do the prevalence of negativity and intensity correlate to the acting style or material this group chose to perform?


[1] The metadata we have comes from several sources, including the Database of Early English Playbooks (DEEP) and Wiggins catalogues. We have also performed cross-checking between these two resources as well as including further reference to the ESTC and JISC Historic Texts, where necessary.

(this information has been taken in part from the VEP Core 1660 readme file [pdf])

TCP – lists the associated unique TCP ID number for the play in question

ESTC – Short Title Catalogue records, in case I need/want to find out more information about these texts

Wiggins Number – in case I want/need to reference the Wiggins catalogue for a particular play; based on Wiggins catalogues published to date

Author 1 – Primary author

Authors 2-5 – any other assigned authors, where applicable

Title – the title by which the play is commonly known – often the contemporary title.

Alternative Title – any other names the play could be known as. For example, the play known as “Volpone” is listed as “Volpone” for primary title, and “Volpone, or The Fox” is considered the secondary title. (We also use the ‘secondary title’ category to describe printed titles when they are different to performed titles.)

Genre – these are the genres originally assigned earlier in the project by Jonathan Hope – either Tragedy [TR], Tragicomedy [TC], Comedy [CO], History [HI], Masque [MA], Interlude [IN], Entertainment [EN], Dialogue [DI], or Non-Dramatic [ND]

Wiggins Genre – based on information from the Wiggins Catalogues (published to date)

DEEP genre – based on information from the Database of Early English Playbooks [link]

Harbage genre – from Harbage’s Annals of English Drama (1989)

Wiggins Contemporary Genre – genre classifications based on contemporary (modern) understandings of genre, taken from the Wiggins catalogues (based on those published thus far)

Date of writing – all texts have been given a date of writing. DEEP doesn’t have a date of writing column but sometimes offers a date range under ‘date of first production’, so in these instances the earliest date was taken for date of writing – if Wiggins offers a fixed date for date of writing then this was taken.

Date of first performance – when the play is understood to be first performed, if known

Play company 1 – The company of first production according to DEEP

Play company 2 – DEEP’s company attribution (where applicable)

Theatre – theatre and/or location of production, where available

 

 

 

Using the Metadata Builder: Getting the information that you want

Yesterday, Deidre wrote about the release of our new Metadata Builder, which collates lots of available information about materials included in the Text Creation Partnership transcriptions in one place. For each corpus available, you have the option of downloading metadata only for texts freely available in the public domain or metadata for texts both freely available and presently restricted, to be made available in the public domain in 2020 (we can distribute information about these restricted-access texts, but we can’t share the files). As a user of the Metadata Builder, I want to be able to take advantage of all the different metadata options available to supplement and guide my analyses. In this post I’ll walk you through a few ways of obtaining a couple different kinds of information using the various kinds information on offer. 

I happen to be interested in the language of dramatic writing. Visualizing English Print offers three different dramatic corpora: the Core Drama 1660 corpus, the Expanded Drama 1660 corpus, and the Expanded Drama 1700 Corpus. (Many scholars of Early Modern drama will be familiar with the Database of Early English Playbooks or DEEP as it is commonly known; this is in some ways quite similar). Of our dramatic corpora, the Expanded Drama 1700 corpus (ED1700) covers the largest quantity of dramatic writing, so I’ll use it as an example.

If I want all the metadata available for this corpus regardless of public-domain status, I would selected ‘All’ available texts in Step 1. However, if I want to use this metadata to guide decisions about a project I might prefer to use the ‘Unrestricted’ version of the corpus, as these texts are all freely available for download from our site.

First things first: to get all of our available metadata for either version of ED1700 specified in Step 1, select ‘all’ under every drop-down menu in Step 2. This is the “all you can eat” option: it will include every piece of metadata we have available, and from there you can download the spreadsheet and its associated readme file in Step 4 and 5. metadatabuilder-allWith everything, you can always further refine your downloaded spreadsheet, but I find it to be useful to keep one master spreadsheet pristine and do metadata manipulations, such as organising by author, date or other parameters on in a second version of the original spreadsheet.

While it is great to have everything, sometimes that can be too overwhelming. This post is therefore not meant to be a how-to guide but more of a ‘ways of thinking about the Metadata Builder’ guide. Here are a few of the metadata columns we offer which I personally find most useful. 

If you want to get the dedicated TCP ID number associated with each transcription, you’ll want to select the category ‘TCP’ from the the dropdown menu “Master Metadata”. These unique TCP identifiers match to a specific transcription: so the TCP identification number A01234 will always link to this specific document. TCP-noESTC data (including Wing numbers, where applicable) is available under the option ‘ESTC’.  

Under Master Metadata, we also offer information from the Wiggins Catalogues of British Drama, including their identification number schema and historical and contemporary generic assignments. Other useful generic information includes the DEEP genre and the Harbage genre, should you want to compare different understandings of genre forms over time or using various criteria to show variation in generic forms. I also I often want to know how many words are in each text, as this is a common way of describing how big or long a text is. This can be found under the Ubiq categories; select ‘# word tokens’ at the the very bottom of the Ubiq dropdown menu in Step 2.

The ability to group plays by company, based on information from the Wiggins Catalogues (using the options for Play Company 1 and Play Company 2 under Master Metadata in Step 2) means that I can easily organise an analysis using attributed information about working theatrical networks of the time and ask what makes the language of plays put on by the King’s Men different than, say, all other companies. With the options of including up to five authors as well, I can start to make these more complex analyses using multiple axis, such as asking only about single-authored plays performed by Queen Henrietta Maria’s Men that are over 20,000 words long.

What I am doing here is not limiting my corpus based on arbitrary features, but by selecting texts which fit certain parameters to get at more specific questions. The more features I pull in, the more information I can base my decisions around, but not all of these categories in Step 2 may be immediately useful. For example, I probably don’t need to know if there are figures (images) in these texts or how many pages long the texts originally were: that’s probably not going to help me. By excluding them from my spreadsheet, I am able to focus more on more relevant information (and if I decide I do want to know about it later, I can always get it from the all-metadata-spreadsheet I downloaded in the first instance).

Another thing I can do with the Metadata Builder is download Docuscope tagging statistics for every text in a specified corpus using the dropdown menu ‘Ubiq’ in Step 2. This means that I do not have to process the ED1700 Unrestricted corpus through Ubiqu+ity myself, but rather combine multiple metadata categories alongside the statistical distributions produced by the Docuscope text-tagging schema.  By selecting relevant metadata categories such as author(s) date of first performance, theatrical group, and several views of genre assignment, I am setting myself up for quite a nuanced multivariate analysis using these particular features.

Finally, the multivariate analyses I suggested above do not necessarily require the use of further computational methods. The ability to isolate all the texts based on a certain principle can guide any number of decisions for studies which rely on close-reading, such as identifying transcriptions which are have multi-lingual content and realising there is a text which you didn’t know about but has a clear connection to your previous work. The Metadata Builder therefore makes the ability to obtain a lot of information about a huge number of texts now available as a result of the TCP project. We look forward to what you will do with it!

VEP’s Metadata Builder

VEP’s Metadata Builder helps users navigate our corpora collections, which are vast in scale. The Metadata Builder provides metadata to users in an accessible, intelligible format, preventing users from having to decipher spreadsheets with more than 120 columns and 160,000 rows. This blog post will explain the motivations for creating the Metadata Builder and its main components, enabling users to understand how to use the tool.

Motivations

This tool was built to allows users to generate and download metadata spreadsheets of VEP corpora tailored to their own research interests. Users can merge metadata from multiple spreadsheets into one, filter their results, and engage with texts. The Metadata Builder provides options for downloading metadata and README documentation generated for tailor-made spreadsheets.

The Metadata Builder’s functionality is guided by data management best practices, which have vast implications for scholarly research. The builder ensures that users have access to the most up-to-date information.

Main Components

The Metadata Builder has five sequential steps:

  1. Pick the Documents
  2. Pick Metadata Fields
  3. Examine the Metadata
  4. Save the Metadata
  5. Save the README

1. PICK THE DOCUMENTS
The first step requires users to select the spreadsheet based on the VEP corpus they are interested in working with. The corpus spreadsheets contain information about corpus texts and files. Hovering over a corpus in the dropdown menu activates a tooltip displaying a brief corpus description. This description is replicated in the Step 1 box once a corpus is selected.

MB_step1

For each corpus, VEP offers two versions of metadata spreadsheets: ‘Unrestricted Only’ and ‘All’. To explain the difference, licensing agreements prevent VEP from releasing corpus text files made from restricted content in our source files. We can share, though, metadata about restricted files for research purposes. VEP provides users the option to download a spreadsheet of only the free files in a corpus (Unrestricted Only) or a spreadsheet of both free and restricted files (All). Unrestricted Only spreadsheets catalog the files available in our corpus downloads.

2. PICK THE METADATA FIELDS
This step of the Metadata Builder can be overwhelming for users not familiar with the data we provide to users. This step is the fun part, though, where users select specific information they are interested in.

To understand this step, users need to know that the Metadata Builder pulls information from multiple existing spreadsheets into one spreadsheet specified by the user. From these spreadsheets, users specify the exact columns they want to be dynamically generated as their dataset. Metadata spreadsheets are listed in bold print. Beside the metadata spreadsheet names are a question mark; hovering over it will display a brief description of the spreadsheet’s contents. Users can also view the spreadsheets by click on the source link to the right of the question mark, and the spreadsheet will appear in a new window. Dropdown menus to the right of the metadata spreadsheet list contains the names of all available columns from each linked spreadsheet, and hovering over the column names will display explanatory tooltips.

MB_step2

Below is a list that explains the metadata type spreadsheets you may see in step two:

  • Master Metadata: This spreadsheet contains the corpus metadata provided by the curator, from text name and author to genre and number of pages in the text.
  • Text Links: This spreadsheet contains links to files of unrestricted corpus content hosted on the VEP server. You can click on the links to read corpus texts.
  • Ubiq Categories: This spreadsheet contains DocuScope LAT information for the files in the selected corpus.
  • TCP Metadata: This spreadsheet contains metadata for all TCP digital texts, provided by the TCP.
    Non-English Language Metadata: This spreadsheet lists primary and secondary languages for texts in the TCP that have the least amount of recognizable English in them.
  • Figures-Per-Text Metadata: This spreadsheet lists the number of FIGURE XML tags that appear in each TCP XML text. It is useful for finding texts that contain tables and images.
  • Derived Date Metadata: This spreadsheet provides a programmatically selected date for all text in the TCP.

Users can select columns from as few as one metadata type spreadsheet to as many as columns from all the metadata type spreadsheets. Once users select metadata columns, a ‘Build Metadata’ button appears in the section. Pressing the button generates your data table in the section for step three.

3. EXAMINE AND CONFIRM THE METADATA
This section renders a dataset based on the specifications entered in section two. It allows users to sort and search the metadata.

MB_step3

4. SAVE THE METADATA
This section allows users to save the generated datasets in a variety of formats. Users can save their datasets as Excel spreadsheets or CSVs. Users can optionally copy all of the information to their clipboard or print it.

MB_step4

5. SAVE THE README
This section contains a red button labeled ‘Download README.’ This step is important. Corpus Builder dynamically generates a README file that explains the contents of datasets.

MB_step5

Stay tuned for an upcoming blog post by Heather that walks users through what she finds to be the Metadata Builder’s most useful features!

VEP Releases

VEP has been busy improving its visualization tools and processing pipeline! You can read about all the changes in the list below.

Release Information

  • Text Processing Pipeline 2.0 features better Unicode handling during character cleaning and a dictionary that standardizes spelling variation across TCP corpora. Read about the pipeline on the ‘Workflow’ page. Download the pipeline from GitHub.
  • TextDNA is available for download! The download includes sample datasets and Python scripts for curating your own TextDNA datasets. Download it from GitHub.
  • Ubiqu+Ity 1.2 is officially released! The SlimTV (or Slim TextViewer) replaces Ubiqu+Ity HTML files for navigating tagged text.
  • Updated corpora (processed with pipeline 2.0) are available for download.

Making your own rules for use with Ubiqu+Ity

Several years ago, Michael Witmore and Jonathan Hope published a paper in Shakespeare Quarterly that describes how the string-matching rhetorical analysis software DocuScope is able to identify stylistic fingerprints of genre in Shakespeare’s plays. Visualizing English Print is proud to make the string-matching rules used by DocuScope available online for general use as part of the multivariate textual analysis package Ubiqu+Ity.

Ubiq landing page

The DocuScope dictionaries, which were initially designed to analyze rhetorical features such as persuasiveness or first-person reporting, covers 40 million linguistic patterns of English classified into over 100 categories of rhetorical effects (see http://www.cmu.edu/dietrich/english/research/docuscope.html for more information). Figure 4, taken from Ishizaki and Kaufer (2011), illustrates their process:

Building the DS Dictionaries

According to David Kaufer, the creator of DocuScope dictionaries, words or phrases which share an ‘aboutness’ can be grouped together in a hierarchical model of what he describes as Language Action Types (LATs); when someone runs his DocuScope dictionary on any given corpus, the software will search for exact matches based on the classifications he has made and report statistical frequencies for each category. While the DocuScope dictionaries are quite specific – in many ways it represents the creators’ view of how language functions – any corpus sent through their dictionary will be analysed in the same way. It doesn’t matter if you send all of Charles Dickens’ novels or emails from your mother or all of Shakespeare’s plays through the DocuScope classification schema; the dictionary will check for the exact same features every time. (The joy of DocuScope, and any string-matching software like this, is that every text uses these terms in a slightly different distributional pattern).

In other words, Ubiqu+Ity matches text to entries in the dictionaries, then computes the percentages of words per document that fall into LAT categories. Essentially, Ubiqu+Ity parses text and then tells you what rules the language falls under according to rules outlined in DocuScope dictionaries. With Ubiqu+Ity, we offer several versions of Kaufer’s dictionaries as well as the ability to create your own rules. What if, for example, you were interested in the language of gender? While the DocuScope dictionaries cover a huge range of rhetorical and linguistic features, it does not have a category explicitly devoted to gender, though terminology related to gender can appear in a variety of existing LATs.

How to specify own rulesAs these instructions suggest, we would need to create our new dictionary with a Comma-Separated Values (CSV) sheet in Excel. To the uninitiated, a Comma-Separated Values is a spreadsheet, but it is a specific kind of spreadsheet format. Where Excel files end with the suffix “.xlsx” (akin to “.docx”, the Word equivalent), CSV files end with the suffix “.csv”. It looks like any other spreadsheet in Excel, but this is a non-proprietary format, which means your data will move comfortably across any software program and retain its structure. The example provided for you above is a tiny bit deceptive though: when you save a file as a csv file, it will include the commas as column delineator for you, so the file will look like the example provided above. If you include any special characters (spaces, punctuation, etc.) in your rules, Ubiqu+Ity will search for that exact match. The table below shows two ways of formatting a set of rules:

GOOD RULE FORMATTING

LESS GOOD RULE FORMATTING

 Screen Shot 2016-10-06 at 2.32.25 pm  Screen Shot 2016-10-06 at 2.33.53 pm

The one on the left is considered good rule formatting, because the computer will recognize it as

he, masculine
his, masculine
him, masculine
man, masculine
boy, masculine
she, feminine
her, feminine
hers, feminine
woman, feminine
girl, feminine

And the one on the right is considered less good, because to the computer this will read

he,, masculine
his,, masculine
him,, masculine
man,, masculine
boy,, masculine
she,, feminine
her,, feminine
hers,, feminine
woman,, feminine
girl,, feminine

(The one on the right may not necessarily be bad formatting outright, depending on what you’re interested in counting, but it definitely is less good formatting than the one on the left if you just want to count words and not words with punctuation!)

These lists can be as long or as short as you want, and they can be as specific or vague as you want: but whatever you tell Ubiqu+Ity to find, it will find. Once you upload your own dictionary, Ubiqu+Ity will use it to analyse your corpus, which is where the real fun starts. Here’s an example I ran using the VEP plain-text version of the Folger Digital Texts Shakespeare corpus (download it from here). You can download a CSV file reporting on the statistics of your user-defined rules, which will look like this:

User-defined CSV rules
click image to make bigger

This spreadsheet reports what percentage of each text uses each user-defined rule, just as it would with the DocuScope dictionaries. I’ve used the rules described above as a good example; the more categories you define, the larger your spreadsheet will be, of course.  From here, you can do the usual Excel things, like graph them to see what the difference between ‘masculine’ and ‘feminine’ words are available in the plays:

Screen Shot 2016-10-07 at 2.11.36 pm

Looking at this chart, I immediately want to know why Two Gentlemen of Verona has such a comparatively high volume of ‘feminine’ terms compared to other Shakespeare plays. But computers are also very good at identifying absence in ways that us humans cannot, so I am also interested in seeing why some plays like 1 Henry 6, Love’s Labours Lost, an a Midsummer Night’s Dream or have a smaller proportion of ‘masculine’ language overall – now I have specific research questions to tackle based on my initial findings.

Editing Programmatically; or, Curating ‘Big Data’ Literature Corpora

No one has time to read and really understand all of the 1,476,894,257 words that comprise the Text Creation Partnership (TCP) digital texts. Considering that adults read on average 300 words a minute, it would take someone about 40 years to read every word in the TCP’s 61,000 texts. That 40-year estimate assumes 52 40-hour work weeks per year—no vacations, no holiday time, no sick time, no lunches or breaks.

How, then, does one editorially intervene in literature datasets at such scale? This task isn’t best carried out by more traditional methods of editing, where a human reader scrutinizes each text word by word and makes local changes. In this post, I will describe the research process I used to create VEP’s dictionary for standardizing the early modern spelling variation captured in TCP texts.

The goal of spelling standardization is to map variant spellings to a standard spelling, like “neuer” and “ne’er” to “never”. This standardization reduces noise in the dataset, providing analytical gains. It makes statistical analysis more accurate for users interested in counting and weighing textual features, like word frequency and part of speech parsing.

To ensure spelling consistency across the dataset, I researched the most frequent original spellings in the TCP texts. The team in Wisconsin decided to aim for a 95% standardization guarantee. To guarantee 95% standardization meant that, to do research efficiently,  I had to research the most frequent words. As a result, I examined the behavior of 26,700 original spellings that occurred 2,000 or more times in the TCP. Their frequencies accounted for 95.04% of the total number of words in the TCP.

I searched for spellings in the TCP using command line.  (Using UNIX is preferable to loading the 61,000 TCP texts into concordance software, as this is a heavy task for GUI-aided text processing.)

spellingStandardizationResearch

I examined thousands and thousands of instances of original spellings in brief context. It was an exercise in brevity, a trade-off between time and human labor. More often than not, the searches returned enough text surrounding the original spelling to understand meaning. (For example, look at the screenshot of my research above: it is obvious when the original spelling “peeres” means “peers” and when it means “pears”.) It would have been too time consuming to open TCP text files and read a paragraph of context for the original spellings. You wouldn’t be able to make a judgment call on a word in a day.

I made the following decisions about original spellings:

  • not to standardize original spellings because they were in what we already recognize as a standard form (e.g., “the”)
  • not to standardize original spellings because they would have introduced too much error into the dataset
  • to standardize original spellings to a certain spelling that was correct most of the time based on the behavior of the spelling across the entire TCP

Standardizing the most frequent original spellings resulted in major payoffs. To illustrate, compare the corpus frequency and rank in corpus for the word “Jews” in the two tables below.

1-Gram in the TCP (Original Spelling)

n-gram corpus frequency rank in corpus
jews 154027 849

1-Gram in the TCP (Standardized Spelling)

n-gram corpus frequency rank in corpus
jews 315702 458

Standardization located 161,678 more instances of the word “Jews” in the TCP, which has vast implications for those who study religion in early modern texts. Standardization yielded a 104% gain in recognition. For the curious, here are original spellings that are standardized to “Jews” in the VEP dictionary: ievves, ievvs, iewes, iews, jevves, jevvs, and jewes.

Data standardization is a form of editorial intervention that can have vast impacts for users. Granted, the provided standardization is a first step, and I invite others to expand upon my work. However, I argue that enforcing spelling consistency in the TCP corpora provides users with a cleaner, more accessible dataset. The standardized spelling makes the dataset easier to search. Users do not need to be experts in Early Modern English spelling conventions to extract meaningful information.

Curious to see what my editorial intervention looks like? I annotated Act 1 Scene 1 from Pericles. The annotations are provided in an interactive Text Viewer. (Image of the Text Viewer is directly below, and the link to access the Text Viewer is at the end of this entry.)

PericlesAnnotatoins

We are providing an annotated SimpleText example of Act 1 Scene 1 from Pericles. (A preview of the annotations are in the image above.) The text is annotated with three tags: Standardized, Researched, and Justified. Standardized highlights words that have had their spelling standardized. If a standardized spelling is incorrect in the Pericles scene, the annotation will explain why within the context of the TCP corpus. Researched highlights words that were researched in the process of compiling the standardization dictionary. These words were not provided standardized spellings because 1) they were in a recognized form and 2) the original spelling had too many different meanings to be standardized. Justified highlights words that were not standardized and provides a reason why (e.g., the word’s frequency was lower than 2,000).

View Annotations for Standardized Spelling in Pericles Act 1 Scene 1

A first attempt at LSA (Latent Semantic Analysis)

LSA is an older corpus processing method – it’s kind of gone out of favor for things like topic modeling – but I like it as an illustration because it is very simple. And to use it, we have to make all the critical decisions we need to do with any statistical modeling thing. And of course, simple experiments are good for exposing errors in our data and process (fixes were made to my code in the process of doing this, but the data is standing up).

LSA is basically performing PCA on the word count vectors. So, it has all the problems of both (it only considers word counts, and its only looking for variance).

First, I need to pick which documents. You might say “that’s easy – use all of EEBO-TCP,” but since I know there are things in the corpus that might throw me off, I might as well cull things first.

For now, I’ll first remove all documents for which our metadata tells us the document is not in English (since I am interested in English). And I’ll remove all documents shorter than 250 words (that number is arbitrary) – since if documents are too short normalization might get weird. (If a document is 5 words long, then each word is 20% of it.) Note these are arbitrary choices – they affect the results, and I should remember that I did them. But it would be appropriate to make different choices.

My data is now a 52049 x 6065951 matrix of 52049 documents in EEBO-TCP that are in English and long enough, 6065951 words to be counted – although this is the number in the whole of TCP since I didn’t remove the words that only appear in this corpus). I could remove words that don’t appear a lot – this is a good idea if for no other reason that it makes things go faster (dealing with 6 million columns takes time). It also helps because the small values for uncommon words not only won’t matter in the answers (since the goal is to find the big trends), and that these small values can actually mess things up.

So, I’ll as a next step get rid of uncommon words. There are many ways to do this – for now, I’ll only allow words that appear in 10 or more documents. That’s still a lot of words (447215).

As a sanity check, I can see if any documents are severely affected by this by seeing what document has the lowest percentage of words that remain after removing those words that occur in 10 or fewer documents. The worst document only retains 58% of its words (it’s A12751, “‘A brief treatise for the measuring of glass, board, timber, or stone, square or round being performed only by simple addition and substraction, and that in whole numbers, with[o]ut any multiplication, or division at all / by John Speidell …” which has a huge table of numbers at the end.

If you’re curious, on my (relatively fast with lots of memory) computer, it takes 132 seconds to run LSA (for 20 vectors – I’ll explain in a moment) on the whole massive matrix, on the reduced (10 or more documents), it takes 56 seconds (about half). In case you’re wondering… this isn’t too surprising. Since those rare words were so rare, the computations were smart enough not to waste much time on them.

What can we learn from this first model?

LSA (or at least my implementation) of it is a little different than PCA in that I don’t mean center things before analyzing the matrix. This difference is small, since the mean comes out as the first component – but it can lead to other subtle differences.

The zeroth component is a vector that says “if you could pick only one vector to make the books” (each book is a scaled amount of this vector) which should you pick? The right answer is to pick the average book. So we’d expect that the zeroth vector to be proportional to the average over the books. So the most common words come out as the “strongest” in this vector. While the vector is big (it has a value for each of 447215 words), we can just look at the biggest entries:

[('the', 0.61865592352172205),
 ('of', 0.40348356075050867),
 ('and', 0.38341452460366676),
 ('to', 0.26211624561016428),
 ('in', 0.18697302248166897),
 ('that', 0.1653677553047907),
 ('a', 0.13252050497989859),
 ('is', 0.11119574845880302),
 ('it', 0.10087079005559051),
 ('he', 0.097740434591416875)]

What this tells us is that if you we want to describe books (or their word vectors) with 1 number, this is the best vector to multiply that number by.

Usually, with PCA we ask about the documents. But here, it won’t tell us much: it will mainly tell us how long the books are. Since our book vectors are word counts, longer books will use more of the “average” stuff. The number with the biggest amount is the longest document.

But, if a document is weird (unlike the average) it might use more or less than you’d expect given how long it is. So we could normalize by the document length. Documents that score low on this are ones that don’t use the common words as much as you would expect (given their lengths). The document with the lowest score? A64613… ‘An useful table for all uictuallers & others dealing in beer & ale’ – which is a big table of numbers.

So before doing PCA/LSA we need to normalize for the length of the books. I will do this by dividing each row of the matrix by the length of the book in words (the L1 norm) – using the word count BEFORE we threw out uncommon words.

So now I am building my first “real” LSA model: I chose EEBO-TCP documents that I know to be in English, only considered documents with more than 250 words, and am normalizing the the original word count of the document.

The zeroth vector is actually slightly different – suggesting that long documents use different proportions of words, although this isn’t necessarily something meaningful or statistically significant.

[('the', 0.59010211288364689),
 ('and', 0.42698349761656479),
 ('of', 0.39453632948464623),
 ('to', 0.28559405040930685),
 ('in', 0.18348404712929461),
 ('that', 0.15612291442776627),
 ('a', 0.13154084693700221),
 ('be', 0.10011301035536961),
 ('is', 0.097944973306089519),
 ('for', 0.096814220825602454)]

Again, this is just the top 10 words.

When I look at documents that are “strong” in this vector, its documents that are made up more common words than normal (with a big weighting on “the”). Here are the top few…

A28654   0.1983  A plain and easy rule to rigge any ship by the ...
A96930   0.1813  The use of the universal ring-dial.
B05531   0.1789  Proclamation anent the rendezvouses of the militia...
B04164   0.1780  Londons ordinary: or, Every man in his humor...
A92666   0.1768  A proclamation anent the rendezvouses of the militi...
B24252   0.1765  An explanation of Mr. Gunter's quadrant, as it ...
A72802   0.1764  Londons ordinary, or every man in his humor To ...
A66689   0.1764  To the Most Excellent Majesty of James the IId ...
A01661   0.1761  This book does create all of the best waters ...
B05530   0.1748  Proclamation anent the rendezvous of the militia, ...

That first one has a lot of “the” in it (as expected). I didn’t check the others, but there seems to be a theme.

If we look at the bottom scorers, we see books that don’t have many common words…

A64613   0.0034  An useful table for all uictuallers & others dealing in beer & ale
A28791   0.0039  A Book of the valuations of all the ecclesiastical preferments in England and Wa
A61096   0.0045  Villare Anglicum, or, A view of the towns of England collected by the appointmen
A88337   0.0062  A list of the imprisoned and secluded Members.
A28785   0.0067  A Book of the names of all parishes, market towns, villages, hamlets, and smalle
A95633   0.0068  A table of excise for strong beer and ale, for common brewers at 3s. 3d. the bar
B09789   0.0081  A list of the English captives taken by the pirates of Argier, made public for t
A95631   0.0082  A table of excise for small beer for common brewers at 9 d. the barrel, with the
A28887   0.0082  An exact alphabetical catalogue of all that have taken the degree of Doctor of P
A48671   0.0082  A list of the names and sums of all the new subscribers for enlarging the capita

Again, not surprising… these are books of lists (so they have no need for many articles).

Now if we look at the next principle vector (the 1st LSA vector – since the previous was the zeroth) we would expect to see the words where we have the most variance from the average. Since words that occur often will vary a lot in how often they occur (a small percentage of variance in a common word is still a lot of variation), we expect to see them again. And we do:

[('the', 0.38420645622523014),
 ('of', 0.32922762730011762),
 ('your', -0.084669152332760914),
 ('his', -0.088128326440172142),
 ('so', -0.089329190198924976),
 ('with', -0.092148634300003029),
 ('for', -0.1008359401690339),
 ('will', -0.11693098516660286),
 ('me', -0.11984594738903771),
 ('not', -0.14030401361190545),
 ('but', -0.14408743558833309),
 ('it', -0.15588034041694304),
 ('that', -0.16401056893070368),
 ('is', -0.17369018634007896),
 ('he', -0.1772685229064182),
 ('you', -0.18990955717829638),
 ('my', -0.19381100534134885),
 ('to', -0.22065219350754178),
 ('a', -0.25761137977106813),
 ('i', -0.37919606548968343)]

What’s interesting here – “the” and “of” are on top, and words that are more personal are at the bottom. If you’re writing a lot about yourself or people, you don’t use as many definite articles. If you write about “the X of Y” a lot, you aren’t going to say much about “I, you, …” So the top on this list is book that are declaring facts:

A28654   0.1152  A plain and easy rule to rigge any ship by the length of his masts, and yards, w
A66689   0.1107  To the Most Excellent Majesty of James the IId by the grace of God of England, S
A18462   0.1027  The Imperial acheiuement of our dread sovereign King Charles together with ye ar
B05531   0.0921  Proclamation anent the rendezvouses of the militia, for the year 1683.
A92666   0.0907  A proclamation anent the rendezvouses of the militia, for the year 1683; Proclam
B05530   0.0900  Proclamation anent the rendezvous of the militia, for the year, 1684.
A93594   0.0886  A survey of the microcosme. Or the anatomy of the bodies of man and woman wherei
A93593   0.0866  An exact survey of the microcosmus or little world being an anatomy, of the bodi
A39982   0.0844  The form of the proceeding to the coronation of Their Majesty's, King James the 
B03350   0.0830  The form of the proceeding to the coronation of their Majesty's, King James the

And the bottom?

B01741  -0.0804  Tobia's advice, or, A remedy for a ranting young man. While you are single you t
A75449  -0.0744  An Answer to unconstant William, or, The Young-man's resolution to pay the young
B00386  -0.0733  A new ditty: of a lover, tossed hither and th[i]ther, that cannot speak his mind
A15699  -0.0726  The honest vvooer his mind expressing in plain and few terms, by which to his mi
B01734  -0.0713  Doubtful Robin; or, Constant Nanny. A new ballad. Tune of, Would you be a man of
A06395  -0.0712  The lovers dream who sleeping, thought he did embrace his love, which when he wa
B00382  -0.0712  The lovers dream: who sleeping, thought he did embrace his love, which when he w
B02419  -0.0712  The country-mans care in choosing a wife: or, A young bachelor hard to be please
B04151  -0.0710  The London lasses lamentation: or, Her fear she should never be married. To the 
B02416  -0.0708  The country lovers conquest. In winning a coy lass ..., To a pleasant new tune,

Notice that this is all based on the most common words – they will drown out all the uncommon words.

Usually in modeling (like LSA), we get rid of the common words. Here, we can use them to learn things about the corpus and to help organize it.

Of course, with a PCA like thing, the first thing someone wants is a scatterplot. With 50,000+ points, it’s a blotch. In time, we need some interactivity. You can identify the extreme points from the lists above.

figure_1-3

Let’s try to understand why PCA puts “the” and “of” together. Here’s a similar plot of “the” on the X axis and “of” on the Y.

theVSof

Not hugely correlated (it’s hard to see with the density). But compare that to a plot of “the” vs. “i”:

theVSi

This is pretty striking, because there are many documents that don’t use “i” at all. And some that use I a lot. Yes, there is a document where 10% of the words are I!

A75449   0.1002  An Answer to unconstant William, ...
B03187   0.0977  An excellent new song, called, ...
B04151   0.0970  The London lasses lamentation: or, ...
A73881   0.0968  A Sweet and pleasant sonet, entitled, ...
B00386   0.0867  A new ditty: of a lover, tossed hither ...
B06543   0.0852  Where Helen lies. To an excellent new ...

Or a document that is nearly 20% “of”!

A75823   0.1917  An Account of what captives has been ...
A38857   0.1825  An exact account of the number of ...
A92491   0.1809  Act for raising four months supply. ...
A31306   0.1798  A Catalogue of the prelates and clergy ...
A38907   0.1734  An Exact catalogue of the names of ...

Can these kinds of analysis of common words tell us anything about thelarge corpora?

Words / NotWords for Plays and All TCP

Let’s try the words/not words game again…

This time I will start with the entire TCP corpus (61315 documents – this includes some Evans and some ECCO). There are 6065951 different words (which is a lot – it gives us a 61315×6065951 matrix). But most words occur once in the whole corpus. I’ll limit things to words that appear in 5 or more documents (there are still 909665 words). Also here I am referring to different words (e.g. “the” counts once) – the total “word count” is 1,541,992,473 (e.g. “the” occurs 89821184 times).

For my subset, I’ll take the drama collection. While there are 1244 plays in the Extended Drama 1700 corpus, there are only 1020 TCP entries (since many plays are in multi-play volumes). Note that I am using entire TCP files for this experiment (so if there is some non-plays mixed in with the plays, I am counting that).

The plays represent 1.56% of the TCP (in terms of word count, after throwing out words that appear in fewer than 5 documents).

Now we can play our words/notwords game… What words are used most in plays (relative to their total usage).

Spoiler alert: I am not going to find too much that’s interesting…

It turns out that there are 32 words that are used only in plays (and remember, this means 5 or more plays, since we excluded words that occur in fewer than 5 documents).

[(1.0, 18, 37, 'foot-marshal'),
 (1.0, 9, 41, 'dsdeath'),
 (1.0, 9, 17, 'budg-batchelors'),
 (1.0, 9, 9, 'three-crane-wharf'),
 (1.0, 8, 30, 'dslife'),
 (1.0, 8, 18, 'yfayth'),
 (1.0, 8, 8, 'theater-royal'),
 (1.0, 7, 10, 'waiting-womans'),
 (1.0, 7, 8, 'shoulder-scarf'),
 (1.0, 7, 7, 'prithe^'),
 (1.0, 7, 7, 'exennt'),
 (1.0, 6, 12, 'ownds'),
 (1.0, 6, 11, 'wudst'),
 (1.0, 6, 11, "shan'not"),
 (1.0, 6, 9, 'pellited'),
 (1.0, 6, 7, 'whooh'),
 (1.0, 6, 6, 'skin-coate'),
 (1.0, 6, 6, 'bawdship'),
 (1.0, 5, 50, 'lucind'),
 (1.0, 5, 8, 'undisguises'),
 (1.0, 5, 7, "yow'le"),
 (1.0, 5, 7, "deserve'em"),
 (1.0, 5, 6, 'skirvy'),
 (1.0, 5, 6, 'fackings'),
 (1.0, 5, 5, 'y^aith'),
 (1.0, 5, 5, 'theatre-royall'),
 (1.0, 5, 5, "th'help"),
 (1.0, 5, 5, 'i*th'),
 (1.0, 5, 5, "hee'de"),
 (1.0, 5, 5, 'faintst'),
 (1.0, 5, 5, 'drawes'),
 (1.0, 5, 5, 'black-fryers-stairs')]

The most common words that don’t appear in plays? The numbers are not a surprise. “Abovementioned” is interesting (and makes sense).

[(0.0, 3884, 20419, 'hebr'),
 (0.0, 3809, 12209, 'hezekiah'),
 (0.0, 3621, 15286, 'ezra'),
 (0.0, 3427, 12124, '3.16'),
 (0.0, 3291, 11042, '1.5'),
 (0.0, 3250, 9704, '2.3'),
 (0.0, 3224, 10329, '2.2'),
 (0.0, 3220, 8945, 'abovementioned'),
 (0.0, 3179, 10275, '3.1'),
 (0.0, 3119, 10031, '3.5'),
 (0.0, 3102, 24106, 'esa'),
 (0.0, 3097, 9560, '2.1'),
 (0.0, 3038, 8862, '2.4'),
 (0.0, 3020, 8639, '2.13'),
 (0.0, 3008, 7163, 'long-suffering'),
 (0.0, 2980, 7576, 'micah'),
 (0.0, 2972, 8629, '3.2'),
 (0.0, 2966, 8331, '3.8'),
 (0.0, 2965, 8506, '3.15'),
 (0.0, 2963, 8884, '2.10')]

To get something more meaningful, let’s limit ourselves to words that appear in many plays, but not many non-plays. Here I limit us to words that occur 20 or more times:

[(0.98181818181818181, 28, 55, 'dyes'),
 (0.97959183673469385, 37, 98, 'sfoote'),
 (0.9666203059805285, 22, 719, 'iord'),
 (0.9322709163346613, 28, 251, 'eup'),
 (0.92708333333333337, 33, 96, 'vmh'),
 (0.92564102564102568, 33, 390, 'iph'),
 (0.92356687898089174, 33, 157, 'wonot'),
 (0.92307692307692313, 31, 78, 'shannot'),
 (0.89333333333333331, 35, 75, "s'foot"),
 (0.88741721854304634, 28, 302, 'borg')]

(A reminder, this means that “borg” appears 302 times in TCP across 28 documents. 88.7% of those 302 times are in plays.)

At the other end of the list, we get words that appear a lot in non-plays, but not much in plays:

[(4.3112739814615219e-05, 5121, 23195, 'eccl'),
 (4.2984869325997249e-05, 3739, 23264, 'isai'),
 (4.0666937779585194e-05, 150, 24590, 'chwi'),
 (4.0412204485754696e-05, 6211, 24745, 'isaiah'),
 (3.389141191622043e-05, 165, 29506, 'wrth'),
 (2.5357541332792372e-05, 185, 39436, 'hyn'),
 (2.5131942699170645e-05, 137, 39790, 'efe'),
 (2.2714366837024417e-05, 6829, 44025, 'sanctification'),
 (2.0277805941397142e-05, 135, 49315, "a'r"),
 (1.1377793959529188e-05, 328, 175781, 'yr')]

As a check, the word “the” has .825% of its occurrences appearing in plays. This is about half of what we might expect (since the plays are 1.5% of the words). In contrast, “a” is 1.9% (or higher than expectation).

The Untranscribable in EEBO

As part of Visualising English Print, I have been evaluating and validating judgments about non-English print in the Text Creation Partnership transcriptions of EEBO. I’ve been looking at texts which have been classified as non-English (or texts that appear to be non-English, such as lists of names or places) by an automated text tagger. Bi- or multi-lingual text cause particular difficulties for this task, as a strong percentage of the text can still be in English but still pose problems by containing a relatively high percentage of untaggable words. (Inconsistent orthography is another big difficulty for this task, which is why VEP is working on improving the machine-readability of the TCP texts).

In the case of Early English Books Online, transcribers were given very specific instructions on how – and what – to transcribe. The TCP provides a map of every character in available in Unicode. This page is extremely thorough, covering a huge range of language and symbol character sets including print symbols (❧, ☞ ,⁂, ¶), alchemical symbols (♁, ♃, ℥, ☋), diacritics, and non-Latinate alphabetical symbols, including those associated with Greek, Hebrew, and Cyrillic. All of these characters therefore have the potential to be transcribed.

Characters which are considered part of the Classical Roman alphabet are retained, though there are few exceptions, such as when the source image is obfuscated by heavy inking or damage to the page. The TCP guidelines also include an entire section devoted to foreign (that is, non-Roman) alphabets. The entire document is linked here, but I’ve replicated the important part below.

  1. “Foreign” (non-Roman) alphabets. Extended text in a non-roman alphabet. Though individual letters (e.g. Greek or Hebrew letters used as manuscript sigla, symbols, reference marks, or abbreviations) should be recorded as special characters, using character entities (see discussion of Characters, below), entire words or extended passages in a non-Roman alphabet (Cyrillic, Hebrew, Greek, Arabic, etc.) should be recorded simply as <GAP DESC=”foreign”>, without transcribing the word(s) themselves. The tags cannot contain any text, though any notes, milestones, page-breaks, etc. that appear within the passage should be recorded as usual, using <GAP> tags before and after the interrupting milestones as necessary.

    Surrounding structures should be preserved if possible, at the highest level that applies. A line of verse quoted in Greek, for example, should be recorded as <Q><L><GAP DESC=”foreign”></L></Q>; a paragrah in Greek as <P><GAP DESC=”foreign”></P>; and a stanza in Greek as <LG><GAP DESC=”foreign”></LG>.

    example of mixed Greek-English textRecord as: the semicircle .18.5, <GAP DESC=”foreign”> .21.7, <GAP DESC=”foreign”> .23

  2. The presence of musical notation should be recorded with the <GAP> tag, with the value of the “DESC” attribute assigned as “music”: <GAP DESC=”music”>.

    Extened spans of music should be captured using a single <GAP> tag, so long as other material (such as text, illustrations, or a page-break) do not interrupt.

    Lyrics printed between lines of music should be recorded as ordinary prose. At every point at which the line of lyrics ends and a line or two of musical notation appears, insert within the running prose a <GAP DESC=”music”> tag.
    Any mathematical formulas or mathematical notation too complicated (or too dependent on two-dimensional layout) to be rendered as plain text should be recorded with the <GAP> tag, with the value of the “DESC” attribute assigned as “math.”

  3. Illegible text, missing and damaged text, or clear but unrecognized symbols all will require some attention from us.Illegible text that cannot be read, for whatever reason, should be marked using variations on the “$” symbol:
    $ = individual character or characters, less than a word.
    $word$ = a whole word
    $span$ = any span of two or more words, less than a page.
    $page$ = a whole page.
    Additional variants are possible if it proves useful to flag some other piece of the structure as unreadable, e.g.:
    $para$ = illegible paragraph
    $line$ = illegible line of verse or prose

    Unknown symbols or characters if they can be distinguished from illegible characters, should preferably be recorded as “#”.

    The illegibility threshhold. Two extremes should be avoided as far as possible: (1) using the illegibility markers promiscuously to avoid capturing text about which there is some difficulty; and (2) “creative” capture of text that really cannot be read, simply in order to avoid using the illegibility marker. We have prepared some examples of both overuse (EXAMPLE SET) and underuse (EXAMPLE SET 1; EXAMPLE SET 2; see also the bottom of SET 3) of the illegibility markers. It is admittedly not always easy to tell when a letter can be recognized with sufficient confidence to make its capture reliable.

This is primarily due to expedience: in order to capture lots of text quickly, transcribers were encouraged to give the most attention to Roman symbols, though glyphs like alchemical symbols are important for being able to read and understand the texts in question and are therefore retained.[1] Print symbols such as pilcrows are also commonly included in the transcriptions, as again they are useful for navigating the text. However, as Chris Powell and Paul Schnaffer very kindly confirmed for me,

The original instruction came to be modified in the course of actual practice. For one thing, majuscule Greek posed nothing like the same difficulties for capture and review as ligatured early modern Greek type did; in fact, it was difficult to think of a rule that would prevent the keyers from capturing it. So we let them do so, and attempted only to correct their tendency to capture Greek “A” as if it were Latin “A” etc., and so on for the other ambiguous glyphs. Lower-case, ligatured Greek we mostly left uncaptured, unless it served some structural purpose, e.g. was part of a title or chapter heading — or unless the TCP editor felt a rare and inexplicable impulse to type it in.

The end result of all this is that much, perhaps most, of the upper-case Greek has been captured, usually correctly, but that the vast majority of lower-case Greek has not been. (and a little bit of Hebrew, again when it served a structural purpose, and was essential for navigation through the book.) Math only when it could readily be represented as running text, or as a TEI/HTML table. Music only when it consisted of individual notes or symbols, no full tablature.

The implication here is that the final TCP texts are Unicode compliant, but only some of the characters from the full Unicode set make it into the TCP transcriptions. So Greek block letters (Γ, Σ, Φ, Ω) are sometimes transcribed, but the corresponding lowercase ligature symbols (γ, ς, φ, ω) are not.[2] Hebrew characters (א,ק,ש,מ) are very, very rarely transcribed, but Hebrew transliterated into Latinate characters will have been transcribed. Arabic is also notably absent, even though it is available in Unicode and is visible in several EEBO texts.

Using the JISC historical books (UK subscribing institutions) interface, a user is able to look at the TCP transcripts alongside the images from EEBO, making it possible to compare the print to the transcriptions. Here are some examples of what I mean when I say ‘untranscribed’.

This trilingual page from TCPID A90307 shows text in Latin, Greek script, and Hebrew:
A90307 trilingual transcribe issueAs we can see, the Greek and Hebrew print in this text are rendered as […], even though there is printed material there. A second example, a few pages later in the same book, is even more illustrative of this issue: the Arabic page shown below is listed as unavailable, even though there is definitely print there, and the next page, in Latin, is transcribed.

A90307 arabic problem

Even though they are available in the Unicode character set, these scripts are erased completely from the Text Creation Partnerships’ texts.

I now want to mention some exceptions which are notable. As I’ve stressed, very little of the Hebrew is transcribed unless it’s been transliterated into Latinate characters, such as in TCPID A37959, a three column book which offers translations between Latin, Hebrew and Welsh:

A57259 transliterated hebrew

Syriac and related languages rendered in Latinate characters are also transcribed.

And as Paul and Chris suggest, block Greek characters can be transcribed, as this page from TCPID A57729 shows:
A57729 greek block transcribed
but, as the TCP guidelines recommend, the Greek ligatures later down the page are not transcribed and are instead replaced with […]:
A57729 greek script untranscribed(The transcriptions do not retain the structure of the printed object!)

There’s also at least one fake alphabet in EEBO which is definitely not transcribed as there are no corresponding Unicode characters. But much of that book (TCPID A57259) is left untranscribed – though to its credit, […] is used for Arabic as opposed to pretending it’s not there at all. When present, Greek block characters in this book get transcribed too. It even contains this hyper-detailed transcription of a stylised alphabet, suggesting the definition of a Roman letter can be quite flexible:
Screen Shot 2016-09-23 at 5.42.04 pm

(Decorated initials are also recorded, in case you were curious.)

Of the languages discussed so far, Greek is by far the most common. Lots of evidence of historical printed Greek has been removed from the TCP corpus as a result of these rules. Although I haven’t seen a book printed entirely in Greek ligature yet, it wouldn’t be out of the question; plenty of texts I did look at contained lines, passages, paragraphs, or columns in ligature which are untranscribed and therefore eliminated from the corpus. Arabic and Hebrew are far less frequently found in EEBO, but it’s good to know they’re there. There may be other missing languages I don’t know about because they aren’t available in multilingual texts which are partially transcribed.

For our purposes, this is not necessarily a bad thing. VEP privileges English-language printed material in our pipeline, as our resources are designed to get high-accuracy ways of visualising, exploring and understanding English-language print from the TCP texts. At least 99% of Early English print is available in Roman characters, and even including Latin, we have a very high level of accuracy and coverage of the TCP data.

Individually, non-Roman glyphs may not represent massive amounts of early printed material, but in aggregate I’d estimate non-English print in EEBO is going to be represent maybe 1% of the entire EEBO-TCP corpus. This is still not a huge – or even meaningful – number, unless you study these languages, in which case it will be a big loss to you! But we will be releasing all the multi-lingual texts in the TCP collection soon, including some notes on languages that do not make it into the transcriptions, meaning you will soon be able to conduct your own investigations on these multi-lingual documents.

+++++++

[1] A small number of alchemical symbols are also available as emoji these days, and confusingly can render as emoji in the transcriptions. They are: ♈ (Aries), ♉ (Taurus), ♊ (Gemini), ♍ (Virgo), ♎ (Libra), ♏ (Scorpio), ♐ (Sagittarius), ♑ (Capricorn), ♒ (Aquarius), ♓ (Pisces), ♌ (Leo). A full list of Unicode-compliant alchemical symbols is available from https://en.wikipedia.org/wiki/Alchemical_symbol#Unicode

[2] For more on lowercase ligature characters in Greek, see the description of Aldine-style characters by Jane Raisch for the JHI: https://jhiblog.org/2016/08/29/greek-to-me-the-hellenism-of-early-print. This essay is an interesting discussion of Greek in Early Modern print more generally, and worth a read if you are interested in the how/why/what of Greek language printing.