Guest Post: Data-Mining King Lear

[I am pleased to offer this guest post by Darby Foster, a first year undergraduate student at Georgia Institute of Technology, majoring in Business Administration/Information Technology Management. Her professor, Dr Sarah Higanbotham, was kind enough to get in touch with me to share Darby’s final paper, which appears in a truncated form here. VEP loves hearing from students whose imaginations have been really taken by the work we do. –hgf]

Darby Foster
Georgia Institute of Technology

First Folio, Emory University, Nov. 2016

When I read King Lear, I became even more curious about this play’s language. The corpus analysis software Ubiqu+ity, allowed me to quantitatively analyze King Lear in terms of the play’s tragedy, trying to gain perspective on just how sad the play really is. My analysis provided substantial evidence against the claims of the literary critic, George Steiner, in terms of Shakespeare and the genre of tragedy. As a Business Administration/IT Management major, I was not overly eager to take an English Literature course, and especially not a Shakespeare course focusing on the 1623 First Folio. And yet I have never been (and perhaps will never be again) so excited about research as I was when I applied data mining to Shakespeare’s late tragedy, King Lear. It began with Michael Witmore’s podcast on data-mining Shakespeare, which inspired me to experiment with data-mining: first with Hamlet, using an online corpus analysis software Voyant to isolate word trends in Hamlet’s soliloquies. In particular, I traced Hamlet’s relative frequencies and found a predominance of comparisons (16 uses of the preposition “like”).

Most people define genre by its overall narrative structure. To a traditional close reader, genre is “a type of literary work characterized by a particular form, style, or purpose” (“Genre”). But to a computer, “genre is a coordinated set of having things and not having things” (Witmore 2011). Data-mining software takes texts/selections of text and counts the occurrences of specific words and phrases. Certain words play a key role in tragic drama in particular, including doubt, sense, nature, and fortune (Booth 1983, 37). DocuScope’s dictionary categorizes thousands of words into “Positivity,” “Negativity,” “Anger,” “Sad,” and so on. By choosing individual words in each category, I found it surprisingly easy to discover its genre.

Hope and Witmore 2004, 2010

Shakespeare’s 1623 First Folio divides the plays according to genre: comedies, histories, and tragedies. While the compilers of this collection of works likely used plot to separate the plays into genres, the same separation can be done using data-mining (Witmore 2011). Unfortunately, at this close level of analysis, the genre of tragedy can be difficult to distinguish. Data-mining software can easily delineate between comedies and histories, but tragedies lie somewhere in between these two genres (Hope and Witmore 2004). DocuScope, a sophisticated data-mining tool, counts the occurrences of specific categories of words and phrases in sections of text and creates graphs to display the findings in a visual manner. The following graph is a scatterplot of 1,000-word pieces of all of Shakespeare’s plays, color-coded based on genre (Witmore 2011). Green dots represent histories, red dots represent comedies, the orange dots correspond to tragedies, and the blue dots represent the late plays. This graph shows that what histories have, comedies lack, and vice versa, while tragedies are more in the middle of these two more defined genres. The patterns in the graph demonstrate that in addition to having similar plot structures and characters, Shakespeare’s plays within the same genre were clearly written with the same language and style.

One of Shakespeare’s most famous tragedies, King Lear, produces fascinating results when data-mined. DocuScope breaks up the text into over 100 categories of words. Each category contains thousands of words that were individually selected by David Kaufer, an English professor at Carnegie Mellon University, to fit a specific idea. One of the most prominent categories in the text of King Lear is “Negativity.” This category contains words such as death, curse, and torturous and corresponds to a total of 798 individual instances of negativity throughout the play (Ishizaki and Kaufer 2012). Such a strong presence of a single emotion greatly influences a work of literature. In this case, it also plays a big role in determining the genre of the play. Data-mining this play clearly reveals the play’s tragic nature.

Anyone who experiences King Lear can likewise tell that the play is a tragedy. From act one, scene one, it is evident that things are going downhill, as the king reveals his “darker purpose” to divide his kingdom into three parts, one for each of his daughters, so they can rule while he takes an “unburdened crawl toward death” (Shakespeare 1997, 1.1.43). From this point forward, the play is filled with pessimism, tragic events, and nihilism. Some argue that the decision to divide the kingdom is the true climax of the story, breaking the mold of traditional Shakespearean tragedies (Bowers 1980, 13). This structure allows no time for introducing the classic narrative fall of Lear; it brings the audience right into the middle of the story, which quickly becomes tragic. The two most loving and loyal characters in the play, Cordelia and Kent, are quickly banished. Not long after, Lear himself is banished from the homes of his daughters and sent out into a terrible storm (Shakespeare 1997, 2.4.295-353). The play becomes less tolerable to the audience as Lear’s mental capacity deteriorates. Rather than the tragedy building slowly over five acts, the audience experiences King Lear’s fall from 1.1. As the play progresses, there is still hope that conflict will be resolved and the protagonist will live on, but Shakespeare refuses to fulfil the desires of his audience (Booth 1983, 17). Cordelia’s death shocks everyone. “Enter Lear, with Cordelia in his arms, and the most terrifying five minutes in literature have begun” (Booth 1983, 11). The play ends, not with poetic justice, but with a father carrying the body of the virtuous young daughter whom he misjudged. And to intensify the tragedy, Lear himself dies just minutes later.

A quantitative perspective on King Lear provides similar results. When graphing the relative frequencies of specific types of language, patterns can be found in the data. An interesting example is with “Positivity,” which contains words and phrases such as trust, blessing, and hope. For example, “I pray you, sir, take patience: I have hope” (Shakespeare 1997, 2.4.130). While overall levels of negativity decrease as the play progresses, so do levels of positivity, which are almost always lower than the levels of negativity.

In the graph above, “Negativity” is represented in red and “Positivity” is represented in blue, over time. The diminishing positivity can be attributed to the nature of tragedy. As more and more tragic events occur, the scenes and characters are filled with less positivity. This increasing level of tragedy correlates to a steadily increasing level of overall sadness. While there are peaks and troughs on the graph of words categorized as “Sad,” the linear regression line shows an overall increase in sadness as the play goes on. This reflects the overall emotions of the characters in the play as well as the mood that is inflicted upon the audience during the tragedy. Language categorized as “Anger” also follows a similar pattern, increasing relatively as the play progresses. In this overlay of the two graphs, with DocuScope categories “Anger” in red and “Sad” in blue, note that the major peaks in both categories of word even somewhat align. These two emotions, anger and sadness, are clearly correlated in this play. Both are typically thought of as negative emotions, which are common in tragedies. When tragic events occur, natural responses often include sadness over what happened and anger that it did happen. In King Lear, characters often experience one or both anger and sadness as a result of something happening in their life.

Lear is Shakespeare’s most tragic play. It is possibly even “the most devastating tragic apprehension in the whole of Western dramatic literature” (Jackson 1996, 26). As Stephen Booth summarizes, “watching Lear is not unlike waiting for the death of a dying friend; our eagerness for the end makes the friend no less dear” (Booth 1983, 17). This very specific feeling captures the experience of King Lear; it is so depressingly tragic that all the audience wants is for the misery of the play to end. This type of incredibly sad tragedy can be categorized with its own name: absolute tragedy. Absolute tragedy “is immune to hope” (Steiner 2004, 4). It leaves no opportunity for the audience to believe that something good will come from all the negativity; it is unquestionably tragic. Such absolute tragedy “presents men and women who the gods torture and kill ‘for their sport’” (Steiner 2004, 11). This action is directly referenced in King Lear, when Gloucester and Edgar recognize late in the play, “As flies to wanton boys are we to th’ gods. They kill us for their sport” (Shakespeare 1997, 4.1.41-42). By this definition, King Lear aligns seamlessly with the definition of absolute tragedy.

Steiner disagrees. According to him, Shakespeare’s only absolute, and therefore most tragic, tragedy is Timon of Athens (Steiner 2004, 12). He argues that Timon’s utterly bleak plot and motifs make this play more tragic than the rest. A scan through DocuScope provides contrary results. In categories that are critical to the genre of tragedy, King Lear dominates. The chart on the right shows the percentage of each play that fits into the DocuScope categories of “Negativity,” “Positivity,” “Anger,” and “Sad.” These values show that King Lear is approximately 1.09 times more negative, 1.59 times sadder, and 1.02 times angrier than Timon of Athens, which also happens to be 1.08 times more positive than King Lear. Based on these metrics, King Lear clearly contains higher concentrations of words that are typically found in tragedies. This quantitative analysis provides a more precise technique for determining absolute tragedy, revealing that Lear is not only an absolute tragedy, but even more tragic than Timon of Athens.

Works Cited
Booth, Stephen. (1983). King Lear, Macbeth, Indefinition, and Tragedy.

Bowers, Fredson. (1980). “The Structure of King Lear.” Shakespeare Quarterly 31 (1): 7-20.

“Genre, N.” (2014) OED Online. Oxford University Press. Accessed February 7, 2017. http://www.oed.com/view/Entry/77629.

Hope, Jonathan and Michael Witmore. (2010). “The Hundredth Psalm to the Tune of ‘Green Sleeves’: Digital Approaches to Shakespeare’s Language of Genre.” Shakespeare Quarterly 61 (3): 357-90.

Hope, Jonathan, and Michael Witmore. (2004). “The Very Large Textual Object: A Prosthetic Reading of Shakespeare.” Early Modern Literary Studies 9 (12). Available online: purl.oclc.org/emls/09-3/hopewhit.htm.

Ishizaki, Suguru and David Kaufer. DocuScope Dictionary. Created 2012. Accessed 7 November 2016. Available online: github.com/docuscope/DocuScope-Dictionary-June-26-2012.

Jackson, Ester Merle. (1966). “King Lear: The Grammar of Tragedy.” Shakespeare Quarterly 17 (1): 25-40.

Shakespeare, William. (1997). King Lear. Ed. R.A. Foakes. London: Arden Shakespeare. Available online: http://shakespeare.mit.edu/lear/.

Steiner, George. (2004). “’Tragedy,’ Reconsidered.” New Literary History 35 (1): 1-15.

Witmore, Michael. Data-Mining Shakespeare. Created 2011. Accessed 7 September 2016. Available online: https://youtu.be/W1RsgUqFEeY.

Using the metadata builder to guide an analysis

As we’ve been releasing new resources for interacting with the TCP files, one of the questions that keeps coming up is “This is great, but what are we supposed to do with this stuff?” In this blog post I’m going to show how you can use the Core 1660 Drama corpus (from our Early Modern Drama collection) and the Metadata Builder to look at plays which didn’t explicitly involve Shakespeare as an author. I also wanted to cover a range of genre classifications (by any measure of genre), and were a manage size.

Using the Metadata Builder, I have the option to collect a variety of metadata from the master spreadsheet associated with the Core Drama corpus. As I want to study the texts freely available as part of the TCP, I select the ‘Unrestricted’ option in Step 1 rather than ‘All’. In this particular case, I am interested in play companies, so I want to ensure I get metadata which will supplement and guide my analysis of play companies. Therefore, I select the following categories in Step 2: TCP, ESTC, Wiggins Number, Author 1, Authors 2-5, Title, Genre, Wiggins Genre, DEEP Genre, Harbage Genre, Wiggins Contemporary Genre, Date of Writing, Date of first performance, Play Company 1, Play Company 2, and Theatre.[1] I could have downloaded more metadata, but these categories seemed most suited to guide an analysis of one particular play company. Looking at the metadata spreadsheet and paying specific attention to the Play Company 1 category, I settled on the Admiral’s Men, as it is inclusive of a diverse range of authors (including Munday, Dekker, Marlowe, Chapman and Peele) while remaining a manageable size (21 plays).

I then isolated the specific TCPIDs associated with each play-text belonging to the group I will now call ‘Admiral’s Men Plays’. Armed with this list, I copied these plays into a new folder to create a subcorpus of plays from the Core 1660 Drama Corpus. Here’s what that looked like:

Screen Shot 2016-11-03 at 2.43.39 pm

Having made decisions about what texts to analyse and moved the files around to create a corpus of Admiral’s Men Plays, I now can set up a multivariate linguistic analysis using Ubiqu+ity to observe some specific linguistic features. I’ve previously written about creating your own rules for Ubiqu+ity, but this time I want to use the standard DocuScope dictionary, which is a rich classification schema of the English language. While I may not necessarily agree with every decision made in what makes up the DocuScope categorization of the English language, it applies the same rules to every text it is given to analyze, which means that it counts the same features every time. Using the default settings on the Ubiqu+ity site, the system sends me a zipped folder of results. Included in this zipped folder is a comma-separated values spreadsheet which reports how much of each file measuring what percentage of each category makes up the whole of the file.

A selection of linguistic categories reported by docuscope.
A selection of linguistic categories reported by the DocuScope dictionary

Due to the nature of how DocuScope categorises language, some linguistic groupings are more likely to be in use than others. For example, FirstPerson (I, me, etc) is more frequent than Apology (sorry, apologies, etc) due to the nature of how language is understood to be distributed: the small boring words like I and me are far more frequent than more contentful words like ‘sorry’ or ‘apologies’ (This is part of a phenomenon called Zipf’s Law and you can read more about it here). You may also have noticed that the filenames use the anonymized TCPID numbers; you can cross-reference for titles using the metadata spreadsheet.

Due to the nature of how DocuScope categorises language, some linguistic groupings are more likely to be in use than others. For example, FirstPerson (I, me, etc) is more frequent than Apology (sorry, apologies, etc) due to the nature of how language is understood to be distributed: the small boring words like I and me are far more frequent than more contentful words like ‘sorry’ or ‘apologies’ (This is part of a phenomenon called Zipf’s Law and you can read more about it here). You may also have noticed that the filenames use the anonymized TCPID numbers; you can cross-reference for titles using the metadata spreadsheet.

With this report, it is also possible to conduct a variety of quantitative analyses.  Our colleagues have projected all of these categories into multidimensional space using Principle Component Analysis, but it is sometimes easier to focus on just a few features. By limiting the spreadsheet to only a handful of Language Action Types, the spreadsheet becomes far more manageable to work with. If I was interested in the category ‘Sad’, I could rank the spreadsheet using Excel’s sort function and see that Two Lamentable Tragedies has the highest quantity of ‘sad’ out of the subcorpus I have constructed. I can then see what other features are listed as highly-ranked for Two Lamentable Tragedies, or I can use I used the SlimTV TextViewer included in the downloaded Ubiqu+ity folder to identify other high-frequency linguistic categories for this play. With the SlimTV viewer, I can see that ‘negativity’ ‘intensity’, and ‘standards-positive’ are all highly ranked:

Negativity, Standards(Positive) and Intensity
Negativity, Standards(Positive) and Intensity are all highly ranked in Two Lamentable Tragedies

And that’s just a jumping off point: what other plays share this linguistic profile? From here I could compare how these specific features are used in other plays performed by the Admiral’s Men. But that’s still broad of a research question, so here are some more specific ones to follow up on: Do others plays in this corpus you made also rank high in those features? How about compared to the majority of the plays in the Core Drama 1660 corpus? Do the prevalence of negativity and intensity correlate to the acting style or material this group chose to perform?


[1] The metadata we have comes from several sources, including the Database of Early English Playbooks (DEEP) and Wiggins catalogues. We have also performed cross-checking between these two resources as well as including further reference to the ESTC and JISC Historic Texts, where necessary.

(this information has been taken in part from the VEP Core 1660 readme file [pdf])

TCP – lists the associated unique TCP ID number for the play in question

ESTC – Short Title Catalogue records, in case I need/want to find out more information about these texts

Wiggins Number – in case I want/need to reference the Wiggins catalogue for a particular play; based on Wiggins catalogues published to date

Author 1 – Primary author

Authors 2-5 – any other assigned authors, where applicable

Title – the title by which the play is commonly known – often the contemporary title.

Alternative Title – any other names the play could be known as. For example, the play known as “Volpone” is listed as “Volpone” for primary title, and “Volpone, or The Fox” is considered the secondary title. (We also use the ‘secondary title’ category to describe printed titles when they are different to performed titles.)

Genre – these are the genres originally assigned earlier in the project by Jonathan Hope – either Tragedy [TR], Tragicomedy [TC], Comedy [CO], History [HI], Masque [MA], Interlude [IN], Entertainment [EN], Dialogue [DI], or Non-Dramatic [ND]

Wiggins Genre – based on information from the Wiggins Catalogues (published to date)

DEEP genre – based on information from the Database of Early English Playbooks [link]

Harbage genre – from Harbage’s Annals of English Drama (1989)

Wiggins Contemporary Genre – genre classifications based on contemporary (modern) understandings of genre, taken from the Wiggins catalogues (based on those published thus far)

Date of writing – all texts have been given a date of writing. DEEP doesn’t have a date of writing column but sometimes offers a date range under ‘date of first production’, so in these instances the earliest date was taken for date of writing – if Wiggins offers a fixed date for date of writing then this was taken.

Date of first performance – when the play is understood to be first performed, if known

Play company 1 – The company of first production according to DEEP

Play company 2 – DEEP’s company attribution (where applicable)

Theatre – theatre and/or location of production, where available

 

 

 

Using the Metadata Builder: Getting the information that you want

Yesterday, Deidre wrote about the release of our new Metadata Builder, which collates lots of available information about materials included in the Text Creation Partnership transcriptions in one place. For each corpus available, you have the option of downloading metadata only for texts freely available in the public domain or metadata for texts both freely available and presently restricted, to be made available in the public domain in 2020 (we can distribute information about these restricted-access texts, but we can’t share the files). As a user of the Metadata Builder, I want to be able to take advantage of all the different metadata options available to supplement and guide my analyses. In this post I’ll walk you through a few ways of obtaining a couple different kinds of information using the various kinds information on offer. 

I happen to be interested in the language of dramatic writing. Visualizing English Print offers three different dramatic corpora: the Core Drama 1660 corpus, the Expanded Drama 1660 corpus, and the Expanded Drama 1700 Corpus. (Many scholars of Early Modern drama will be familiar with the Database of Early English Playbooks or DEEP as it is commonly known; this is in some ways quite similar). Of our dramatic corpora, the Expanded Drama 1700 corpus (ED1700) covers the largest quantity of dramatic writing, so I’ll use it as an example.

If I want all the metadata available for this corpus regardless of public-domain status, I would selected ‘All’ available texts in Step 1. However, if I want to use this metadata to guide decisions about a project I might prefer to use the ‘Unrestricted’ version of the corpus, as these texts are all freely available for download from our site.

First things first: to get all of our available metadata for either version of ED1700 specified in Step 1, select ‘all’ under every drop-down menu in Step 2. This is the “all you can eat” option: it will include every piece of metadata we have available, and from there you can download the spreadsheet and its associated readme file in Step 4 and 5. metadatabuilder-allWith everything, you can always further refine your downloaded spreadsheet, but I find it to be useful to keep one master spreadsheet pristine and do metadata manipulations, such as organising by author, date or other parameters on in a second version of the original spreadsheet.

While it is great to have everything, sometimes that can be too overwhelming. This post is therefore not meant to be a how-to guide but more of a ‘ways of thinking about the Metadata Builder’ guide. Here are a few of the metadata columns we offer which I personally find most useful. 

If you want to get the dedicated TCP ID number associated with each transcription, you’ll want to select the category ‘TCP’ from the the dropdown menu “Master Metadata”. These unique TCP identifiers match to a specific transcription: so the TCP identification number A01234 will always link to this specific document. TCP-noESTC data (including Wing numbers, where applicable) is available under the option ‘ESTC’.  

Under Master Metadata, we also offer information from the Wiggins Catalogues of British Drama, including their identification number schema and historical and contemporary generic assignments. Other useful generic information includes the DEEP genre and the Harbage genre, should you want to compare different understandings of genre forms over time or using various criteria to show variation in generic forms. I also I often want to know how many words are in each text, as this is a common way of describing how big or long a text is. This can be found under the Ubiq categories; select ‘# word tokens’ at the the very bottom of the Ubiq dropdown menu in Step 2.

The ability to group plays by company, based on information from the Wiggins Catalogues (using the options for Play Company 1 and Play Company 2 under Master Metadata in Step 2) means that I can easily organise an analysis using attributed information about working theatrical networks of the time and ask what makes the language of plays put on by the King’s Men different than, say, all other companies. With the options of including up to five authors as well, I can start to make these more complex analyses using multiple axis, such as asking only about single-authored plays performed by Queen Henrietta Maria’s Men that are over 20,000 words long.

What I am doing here is not limiting my corpus based on arbitrary features, but by selecting texts which fit certain parameters to get at more specific questions. The more features I pull in, the more information I can base my decisions around, but not all of these categories in Step 2 may be immediately useful. For example, I probably don’t need to know if there are figures (images) in these texts or how many pages long the texts originally were: that’s probably not going to help me. By excluding them from my spreadsheet, I am able to focus more on more relevant information (and if I decide I do want to know about it later, I can always get it from the all-metadata-spreadsheet I downloaded in the first instance).

Another thing I can do with the Metadata Builder is download Docuscope tagging statistics for every text in a specified corpus using the dropdown menu ‘Ubiq’ in Step 2. This means that I do not have to process the ED1700 Unrestricted corpus through Ubiqu+ity myself, but rather combine multiple metadata categories alongside the statistical distributions produced by the Docuscope text-tagging schema.  By selecting relevant metadata categories such as author(s) date of first performance, theatrical group, and several views of genre assignment, I am setting myself up for quite a nuanced multivariate analysis using these particular features.

Finally, the multivariate analyses I suggested above do not necessarily require the use of further computational methods. The ability to isolate all the texts based on a certain principle can guide any number of decisions for studies which rely on close-reading, such as identifying transcriptions which are have multi-lingual content and realising there is a text which you didn’t know about but has a clear connection to your previous work. The Metadata Builder therefore makes the ability to obtain a lot of information about a huge number of texts now available as a result of the TCP project. We look forward to what you will do with it!

Making your own rules for use with Ubiqu+Ity

Several years ago, Michael Witmore and Jonathan Hope published a paper in Shakespeare Quarterly that describes how the string-matching rhetorical analysis software DocuScope is able to identify stylistic fingerprints of genre in Shakespeare’s plays. Visualizing English Print is proud to make the string-matching rules used by DocuScope available online for general use as part of the multivariate textual analysis package Ubiqu+Ity.

Ubiq landing page

The DocuScope dictionaries, which were initially designed to analyze rhetorical features such as persuasiveness or first-person reporting, covers 40 million linguistic patterns of English classified into over 100 categories of rhetorical effects (see http://www.cmu.edu/dietrich/english/research/docuscope.html for more information). Figure 4, taken from Ishizaki and Kaufer (2011), illustrates their process:

Building the DS Dictionaries

According to David Kaufer, the creator of DocuScope dictionaries, words or phrases which share an ‘aboutness’ can be grouped together in a hierarchical model of what he describes as Language Action Types (LATs); when someone runs his DocuScope dictionary on any given corpus, the software will search for exact matches based on the classifications he has made and report statistical frequencies for each category. While the DocuScope dictionaries are quite specific – in many ways it represents the creators’ view of how language functions – any corpus sent through their dictionary will be analysed in the same way. It doesn’t matter if you send all of Charles Dickens’ novels or emails from your mother or all of Shakespeare’s plays through the DocuScope classification schema; the dictionary will check for the exact same features every time. (The joy of DocuScope, and any string-matching software like this, is that every text uses these terms in a slightly different distributional pattern).

In other words, Ubiqu+Ity matches text to entries in the dictionaries, then computes the percentages of words per document that fall into LAT categories. Essentially, Ubiqu+Ity parses text and then tells you what rules the language falls under according to rules outlined in DocuScope dictionaries. With Ubiqu+Ity, we offer several versions of Kaufer’s dictionaries as well as the ability to create your own rules. What if, for example, you were interested in the language of gender? While the DocuScope dictionaries cover a huge range of rhetorical and linguistic features, it does not have a category explicitly devoted to gender, though terminology related to gender can appear in a variety of existing LATs.

How to specify own rulesAs these instructions suggest, we would need to create our new dictionary with a Comma-Separated Values (CSV) sheet in Excel. To the uninitiated, a Comma-Separated Values is a spreadsheet, but it is a specific kind of spreadsheet format. Where Excel files end with the suffix “.xlsx” (akin to “.docx”, the Word equivalent), CSV files end with the suffix “.csv”. It looks like any other spreadsheet in Excel, but this is a non-proprietary format, which means your data will move comfortably across any software program and retain its structure. The example provided for you above is a tiny bit deceptive though: when you save a file as a csv file, it will include the commas as column delineator for you, so the file will look like the example provided above. If you include any special characters (spaces, punctuation, etc.) in your rules, Ubiqu+Ity will search for that exact match. The table below shows two ways of formatting a set of rules:

GOOD RULE FORMATTING

LESS GOOD RULE FORMATTING

 Screen Shot 2016-10-06 at 2.32.25 pm  Screen Shot 2016-10-06 at 2.33.53 pm

The one on the left is considered good rule formatting, because the computer will recognize it as

he, masculine
his, masculine
him, masculine
man, masculine
boy, masculine
she, feminine
her, feminine
hers, feminine
woman, feminine
girl, feminine

And the one on the right is considered less good, because to the computer this will read

he,, masculine
his,, masculine
him,, masculine
man,, masculine
boy,, masculine
she,, feminine
her,, feminine
hers,, feminine
woman,, feminine
girl,, feminine

(The one on the right may not necessarily be bad formatting outright, depending on what you’re interested in counting, but it definitely is less good formatting than the one on the left if you just want to count words and not words with punctuation!)

These lists can be as long or as short as you want, and they can be as specific or vague as you want: but whatever you tell Ubiqu+Ity to find, it will find. Once you upload your own dictionary, Ubiqu+Ity will use it to analyse your corpus, which is where the real fun starts. Here’s an example I ran using the VEP plain-text version of the Folger Digital Texts Shakespeare corpus (download it from here). You can download a CSV file reporting on the statistics of your user-defined rules, which will look like this:

User-defined CSV rules
click image to make bigger

This spreadsheet reports what percentage of each text uses each user-defined rule, just as it would with the DocuScope dictionaries. I’ve used the rules described above as a good example; the more categories you define, the larger your spreadsheet will be, of course.  From here, you can do the usual Excel things, like graph them to see what the difference between ‘masculine’ and ‘feminine’ words are available in the plays:

Screen Shot 2016-10-07 at 2.11.36 pm

Looking at this chart, I immediately want to know why Two Gentlemen of Verona has such a comparatively high volume of ‘feminine’ terms compared to other Shakespeare plays. But computers are also very good at identifying absence in ways that us humans cannot, so I am also interested in seeing why some plays like 1 Henry 6, Love’s Labours Lost, an a Midsummer Night’s Dream or have a smaller proportion of ‘masculine’ language overall – now I have specific research questions to tackle based on my initial findings.

The Untranscribable in EEBO

As part of Visualising English Print, I have been evaluating and validating judgments about non-English print in the Text Creation Partnership transcriptions of EEBO. I’ve been looking at texts which have been classified as non-English (or texts that appear to be non-English, such as lists of names or places) by an automated text tagger. Bi- or multi-lingual text cause particular difficulties for this task, as a strong percentage of the text can still be in English but still pose problems by containing a relatively high percentage of untaggable words. (Inconsistent orthography is another big difficulty for this task, which is why VEP is working on improving the machine-readability of the TCP texts).

In the case of Early English Books Online, transcribers were given very specific instructions on how – and what – to transcribe. The TCP provides a map of every character in available in Unicode. This page is extremely thorough, covering a huge range of language and symbol character sets including print symbols (❧, ☞ ,⁂, ¶), alchemical symbols (♁, ♃, ℥, ☋), diacritics, and non-Latinate alphabetical symbols, including those associated with Greek, Hebrew, and Cyrillic. All of these characters therefore have the potential to be transcribed.

Characters which are considered part of the Classical Roman alphabet are retained, though there are few exceptions, such as when the source image is obfuscated by heavy inking or damage to the page. The TCP guidelines also include an entire section devoted to foreign (that is, non-Roman) alphabets. The entire document is linked here, but I’ve replicated the important part below.

  1. “Foreign” (non-Roman) alphabets. Extended text in a non-roman alphabet. Though individual letters (e.g. Greek or Hebrew letters used as manuscript sigla, symbols, reference marks, or abbreviations) should be recorded as special characters, using character entities (see discussion of Characters, below), entire words or extended passages in a non-Roman alphabet (Cyrillic, Hebrew, Greek, Arabic, etc.) should be recorded simply as <GAP DESC=”foreign”>, without transcribing the word(s) themselves. The tags cannot contain any text, though any notes, milestones, page-breaks, etc. that appear within the passage should be recorded as usual, using <GAP> tags before and after the interrupting milestones as necessary.

    Surrounding structures should be preserved if possible, at the highest level that applies. A line of verse quoted in Greek, for example, should be recorded as <Q><L><GAP DESC=”foreign”></L></Q>; a paragrah in Greek as <P><GAP DESC=”foreign”></P>; and a stanza in Greek as <LG><GAP DESC=”foreign”></LG>.

    example of mixed Greek-English textRecord as: the semicircle .18.5, <GAP DESC=”foreign”> .21.7, <GAP DESC=”foreign”> .23

  2. The presence of musical notation should be recorded with the <GAP> tag, with the value of the “DESC” attribute assigned as “music”: <GAP DESC=”music”>.

    Extened spans of music should be captured using a single <GAP> tag, so long as other material (such as text, illustrations, or a page-break) do not interrupt.

    Lyrics printed between lines of music should be recorded as ordinary prose. At every point at which the line of lyrics ends and a line or two of musical notation appears, insert within the running prose a <GAP DESC=”music”> tag.
    Any mathematical formulas or mathematical notation too complicated (or too dependent on two-dimensional layout) to be rendered as plain text should be recorded with the <GAP> tag, with the value of the “DESC” attribute assigned as “math.”

  3. Illegible text, missing and damaged text, or clear but unrecognized symbols all will require some attention from us.Illegible text that cannot be read, for whatever reason, should be marked using variations on the “$” symbol:
    $ = individual character or characters, less than a word.
    $word$ = a whole word
    $span$ = any span of two or more words, less than a page.
    $page$ = a whole page.
    Additional variants are possible if it proves useful to flag some other piece of the structure as unreadable, e.g.:
    $para$ = illegible paragraph
    $line$ = illegible line of verse or prose

    Unknown symbols or characters if they can be distinguished from illegible characters, should preferably be recorded as “#”.

    The illegibility threshhold. Two extremes should be avoided as far as possible: (1) using the illegibility markers promiscuously to avoid capturing text about which there is some difficulty; and (2) “creative” capture of text that really cannot be read, simply in order to avoid using the illegibility marker. We have prepared some examples of both overuse (EXAMPLE SET) and underuse (EXAMPLE SET 1; EXAMPLE SET 2; see also the bottom of SET 3) of the illegibility markers. It is admittedly not always easy to tell when a letter can be recognized with sufficient confidence to make its capture reliable.

This is primarily due to expedience: in order to capture lots of text quickly, transcribers were encouraged to give the most attention to Roman symbols, though glyphs like alchemical symbols are important for being able to read and understand the texts in question and are therefore retained.[1] Print symbols such as pilcrows are also commonly included in the transcriptions, as again they are useful for navigating the text. However, as Chris Powell and Paul Schnaffer very kindly confirmed for me,

The original instruction came to be modified in the course of actual practice. For one thing, majuscule Greek posed nothing like the same difficulties for capture and review as ligatured early modern Greek type did; in fact, it was difficult to think of a rule that would prevent the keyers from capturing it. So we let them do so, and attempted only to correct their tendency to capture Greek “A” as if it were Latin “A” etc., and so on for the other ambiguous glyphs. Lower-case, ligatured Greek we mostly left uncaptured, unless it served some structural purpose, e.g. was part of a title or chapter heading — or unless the TCP editor felt a rare and inexplicable impulse to type it in.

The end result of all this is that much, perhaps most, of the upper-case Greek has been captured, usually correctly, but that the vast majority of lower-case Greek has not been. (and a little bit of Hebrew, again when it served a structural purpose, and was essential for navigation through the book.) Math only when it could readily be represented as running text, or as a TEI/HTML table. Music only when it consisted of individual notes or symbols, no full tablature.

The implication here is that the final TCP texts are Unicode compliant, but only some of the characters from the full Unicode set make it into the TCP transcriptions. So Greek block letters (Γ, Σ, Φ, Ω) are sometimes transcribed, but the corresponding lowercase ligature symbols (γ, ς, φ, ω) are not.[2] Hebrew characters (א,ק,ש,מ) are very, very rarely transcribed, but Hebrew transliterated into Latinate characters will have been transcribed. Arabic is also notably absent, even though it is available in Unicode and is visible in several EEBO texts.

Using the JISC historical books (UK subscribing institutions) interface, a user is able to look at the TCP transcripts alongside the images from EEBO, making it possible to compare the print to the transcriptions. Here are some examples of what I mean when I say ‘untranscribed’.

This trilingual page from TCPID A90307 shows text in Latin, Greek script, and Hebrew:
A90307 trilingual transcribe issueAs we can see, the Greek and Hebrew print in this text are rendered as […], even though there is printed material there. A second example, a few pages later in the same book, is even more illustrative of this issue: the Arabic page shown below is listed as unavailable, even though there is definitely print there, and the next page, in Latin, is transcribed.

A90307 arabic problem

Even though they are available in the Unicode character set, these scripts are erased completely from the Text Creation Partnerships’ texts.

I now want to mention some exceptions which are notable. As I’ve stressed, very little of the Hebrew is transcribed unless it’s been transliterated into Latinate characters, such as in TCPID A37959, a three column book which offers translations between Latin, Hebrew and Welsh:

A57259 transliterated hebrew

Syriac and related languages rendered in Latinate characters are also transcribed.

And as Paul and Chris suggest, block Greek characters can be transcribed, as this page from TCPID A57729 shows:
A57729 greek block transcribed
but, as the TCP guidelines recommend, the Greek ligatures later down the page are not transcribed and are instead replaced with […]:
A57729 greek script untranscribed(The transcriptions do not retain the structure of the printed object!)

There’s also at least one fake alphabet in EEBO which is definitely not transcribed as there are no corresponding Unicode characters. But much of that book (TCPID A57259) is left untranscribed – though to its credit, […] is used for Arabic as opposed to pretending it’s not there at all. When present, Greek block characters in this book get transcribed too. It even contains this hyper-detailed transcription of a stylised alphabet, suggesting the definition of a Roman letter can be quite flexible:
Screen Shot 2016-09-23 at 5.42.04 pm

(Decorated initials are also recorded, in case you were curious.)

Of the languages discussed so far, Greek is by far the most common. Lots of evidence of historical printed Greek has been removed from the TCP corpus as a result of these rules. Although I haven’t seen a book printed entirely in Greek ligature yet, it wouldn’t be out of the question; plenty of texts I did look at contained lines, passages, paragraphs, or columns in ligature which are untranscribed and therefore eliminated from the corpus. Arabic and Hebrew are far less frequently found in EEBO, but it’s good to know they’re there. There may be other missing languages I don’t know about because they aren’t available in multilingual texts which are partially transcribed.

For our purposes, this is not necessarily a bad thing. VEP privileges English-language printed material in our pipeline, as our resources are designed to get high-accuracy ways of visualising, exploring and understanding English-language print from the TCP texts. At least 99% of Early English print is available in Roman characters, and even including Latin, we have a very high level of accuracy and coverage of the TCP data.

Individually, non-Roman glyphs may not represent massive amounts of early printed material, but in aggregate I’d estimate non-English print in EEBO is going to be represent maybe 1% of the entire EEBO-TCP corpus. This is still not a huge – or even meaningful – number, unless you study these languages, in which case it will be a big loss to you! But we will be releasing all the multi-lingual texts in the TCP collection soon, including some notes on languages that do not make it into the transcriptions, meaning you will soon be able to conduct your own investigations on these multi-lingual documents.

+++++++

[1] A small number of alchemical symbols are also available as emoji these days, and confusingly can render as emoji in the transcriptions. They are: ♈ (Aries), ♉ (Taurus), ♊ (Gemini), ♍ (Virgo), ♎ (Libra), ♏ (Scorpio), ♐ (Sagittarius), ♑ (Capricorn), ♒ (Aquarius), ♓ (Pisces), ♌ (Leo). A full list of Unicode-compliant alchemical symbols is available from https://en.wikipedia.org/wiki/Alchemical_symbol#Unicode

[2] For more on lowercase ligature characters in Greek, see the description of Aldine-style characters by Jane Raisch for the JHI: https://jhiblog.org/2016/08/29/greek-to-me-the-hellenism-of-early-print. This essay is an interesting discussion of Greek in Early Modern print more generally, and worth a read if you are interested in the how/why/what of Greek language printing.