Using the Metadata Builder: Getting the information that you want

Yesterday, Deidre wrote about the release of our new Metadata Builder, which collates lots of available information about materials included in the Text Creation Partnership transcriptions in one place. For each corpus available, you have the option of downloading metadata only for texts freely available in the public domain or metadata for texts both freely available and presently restricted, to be made available in the public domain in 2020 (we can distribute information about these restricted-access texts, but we can’t share the files). As a user of the Metadata Builder, I want to be able to take advantage of all the different metadata options available to supplement and guide my analyses. In this post I’ll walk you through a few ways of obtaining a couple different kinds of information using the various kinds information on offer. 

I happen to be interested in the language of dramatic writing. Visualizing English Print offers three different dramatic corpora: the Core Drama 1660 corpus, the Expanded Drama 1660 corpus, and the Expanded Drama 1700 Corpus. (Many scholars of Early Modern drama will be familiar with the Database of Early English Playbooks or DEEP as it is commonly known; this is in some ways quite similar). Of our dramatic corpora, the Expanded Drama 1700 corpus (ED1700) covers the largest quantity of dramatic writing, so I’ll use it as an example.

If I want all the metadata available for this corpus regardless of public-domain status, I would selected ‘All’ available texts in Step 1. However, if I want to use this metadata to guide decisions about a project I might prefer to use the ‘Unrestricted’ version of the corpus, as these texts are all freely available for download from our site.

First things first: to get all of our available metadata for either version of ED1700 specified in Step 1, select ‘all’ under every drop-down menu in Step 2. This is the “all you can eat” option: it will include every piece of metadata we have available, and from there you can download the spreadsheet and its associated readme file in Step 4 and 5. metadatabuilder-allWith everything, you can always further refine your downloaded spreadsheet, but I find it to be useful to keep one master spreadsheet pristine and do metadata manipulations, such as organising by author, date or other parameters on in a second version of the original spreadsheet.

While it is great to have everything, sometimes that can be too overwhelming. This post is therefore not meant to be a how-to guide but more of a ‘ways of thinking about the Metadata Builder’ guide. Here are a few of the metadata columns we offer which I personally find most useful. 

If you want to get the dedicated TCP ID number associated with each transcription, you’ll want to select the category ‘TCP’ from the the dropdown menu “Master Metadata”. These unique TCP identifiers match to a specific transcription: so the TCP identification number A01234 will always link to this specific document. TCP-noESTC data (including Wing numbers, where applicable) is available under the option ‘ESTC’.  

Under Master Metadata, we also offer information from the Wiggins Catalogues of British Drama, including their identification number schema and historical and contemporary generic assignments. Other useful generic information includes the DEEP genre and the Harbage genre, should you want to compare different understandings of genre forms over time or using various criteria to show variation in generic forms. I also I often want to know how many words are in each text, as this is a common way of describing how big or long a text is. This can be found under the Ubiq categories; select ‘# word tokens’ at the the very bottom of the Ubiq dropdown menu in Step 2.

The ability to group plays by company, based on information from the Wiggins Catalogues (using the options for Play Company 1 and Play Company 2 under Master Metadata in Step 2) means that I can easily organise an analysis using attributed information about working theatrical networks of the time and ask what makes the language of plays put on by the King’s Men different than, say, all other companies. With the options of including up to five authors as well, I can start to make these more complex analyses using multiple axis, such as asking only about single-authored plays performed by Queen Henrietta Maria’s Men that are over 20,000 words long.

What I am doing here is not limiting my corpus based on arbitrary features, but by selecting texts which fit certain parameters to get at more specific questions. The more features I pull in, the more information I can base my decisions around, but not all of these categories in Step 2 may be immediately useful. For example, I probably don’t need to know if there are figures (images) in these texts or how many pages long the texts originally were: that’s probably not going to help me. By excluding them from my spreadsheet, I am able to focus more on more relevant information (and if I decide I do want to know about it later, I can always get it from the all-metadata-spreadsheet I downloaded in the first instance).

Another thing I can do with the Metadata Builder is download Docuscope tagging statistics for every text in a specified corpus using the dropdown menu ‘Ubiq’ in Step 2. This means that I do not have to process the ED1700 Unrestricted corpus through Ubiqu+ity myself, but rather combine multiple metadata categories alongside the statistical distributions produced by the Docuscope text-tagging schema.  By selecting relevant metadata categories such as author(s) date of first performance, theatrical group, and several views of genre assignment, I am setting myself up for quite a nuanced multivariate analysis using these particular features.

Finally, the multivariate analyses I suggested above do not necessarily require the use of further computational methods. The ability to isolate all the texts based on a certain principle can guide any number of decisions for studies which rely on close-reading, such as identifying transcriptions which are have multi-lingual content and realising there is a text which you didn’t know about but has a clear connection to your previous work. The Metadata Builder therefore makes the ability to obtain a lot of information about a huge number of texts now available as a result of the TCP project. We look forward to what you will do with it!

The Untranscribable in EEBO

As part of Visualising English Print, I have been evaluating and validating judgments about non-English print in the Text Creation Partnership transcriptions of EEBO. I’ve been looking at texts which have been classified as non-English (or texts that appear to be non-English, such as lists of names or places) by an automated text tagger. Bi- or multi-lingual text cause particular difficulties for this task, as a strong percentage of the text can still be in English but still pose problems by containing a relatively high percentage of untaggable words. (Inconsistent orthography is another big difficulty for this task, which is why VEP is working on improving the machine-readability of the TCP texts).

In the case of Early English Books Online, transcribers were given very specific instructions on how – and what – to transcribe. The TCP provides a map of every character in available in Unicode. This page is extremely thorough, covering a huge range of language and symbol character sets including print symbols (❧, ☞ ,⁂, ¶), alchemical symbols (♁, ♃, ℥, ☋), diacritics, and non-Latinate alphabetical symbols, including those associated with Greek, Hebrew, and Cyrillic. All of these characters therefore have the potential to be transcribed.

Characters which are considered part of the Classical Roman alphabet are retained, though there are few exceptions, such as when the source image is obfuscated by heavy inking or damage to the page. The TCP guidelines also include an entire section devoted to foreign (that is, non-Roman) alphabets. The entire document is linked here, but I’ve replicated the important part below.

  1. “Foreign” (non-Roman) alphabets. Extended text in a non-roman alphabet. Though individual letters (e.g. Greek or Hebrew letters used as manuscript sigla, symbols, reference marks, or abbreviations) should be recorded as special characters, using character entities (see discussion of Characters, below), entire words or extended passages in a non-Roman alphabet (Cyrillic, Hebrew, Greek, Arabic, etc.) should be recorded simply as <GAP DESC=”foreign”>, without transcribing the word(s) themselves. The tags cannot contain any text, though any notes, milestones, page-breaks, etc. that appear within the passage should be recorded as usual, using <GAP> tags before and after the interrupting milestones as necessary.

    Surrounding structures should be preserved if possible, at the highest level that applies. A line of verse quoted in Greek, for example, should be recorded as <Q><L><GAP DESC=”foreign”></L></Q>; a paragrah in Greek as <P><GAP DESC=”foreign”></P>; and a stanza in Greek as <LG><GAP DESC=”foreign”></LG>.

    example of mixed Greek-English textRecord as: the semicircle .18.5, <GAP DESC=”foreign”> .21.7, <GAP DESC=”foreign”> .23

  2. The presence of musical notation should be recorded with the <GAP> tag, with the value of the “DESC” attribute assigned as “music”: <GAP DESC=”music”>.

    Extened spans of music should be captured using a single <GAP> tag, so long as other material (such as text, illustrations, or a page-break) do not interrupt.

    Lyrics printed between lines of music should be recorded as ordinary prose. At every point at which the line of lyrics ends and a line or two of musical notation appears, insert within the running prose a <GAP DESC=”music”> tag.
    Any mathematical formulas or mathematical notation too complicated (or too dependent on two-dimensional layout) to be rendered as plain text should be recorded with the <GAP> tag, with the value of the “DESC” attribute assigned as “math.”

  3. Illegible text, missing and damaged text, or clear but unrecognized symbols all will require some attention from us.Illegible text that cannot be read, for whatever reason, should be marked using variations on the “$” symbol:
    $ = individual character or characters, less than a word.
    $word$ = a whole word
    $span$ = any span of two or more words, less than a page.
    $page$ = a whole page.
    Additional variants are possible if it proves useful to flag some other piece of the structure as unreadable, e.g.:
    $para$ = illegible paragraph
    $line$ = illegible line of verse or prose

    Unknown symbols or characters if they can be distinguished from illegible characters, should preferably be recorded as “#”.

    The illegibility threshhold. Two extremes should be avoided as far as possible: (1) using the illegibility markers promiscuously to avoid capturing text about which there is some difficulty; and (2) “creative” capture of text that really cannot be read, simply in order to avoid using the illegibility marker. We have prepared some examples of both overuse (EXAMPLE SET) and underuse (EXAMPLE SET 1; EXAMPLE SET 2; see also the bottom of SET 3) of the illegibility markers. It is admittedly not always easy to tell when a letter can be recognized with sufficient confidence to make its capture reliable.

This is primarily due to expedience: in order to capture lots of text quickly, transcribers were encouraged to give the most attention to Roman symbols, though glyphs like alchemical symbols are important for being able to read and understand the texts in question and are therefore retained.[1] Print symbols such as pilcrows are also commonly included in the transcriptions, as again they are useful for navigating the text. However, as Chris Powell and Paul Schnaffer very kindly confirmed for me,

The original instruction came to be modified in the course of actual practice. For one thing, majuscule Greek posed nothing like the same difficulties for capture and review as ligatured early modern Greek type did; in fact, it was difficult to think of a rule that would prevent the keyers from capturing it. So we let them do so, and attempted only to correct their tendency to capture Greek “A” as if it were Latin “A” etc., and so on for the other ambiguous glyphs. Lower-case, ligatured Greek we mostly left uncaptured, unless it served some structural purpose, e.g. was part of a title or chapter heading — or unless the TCP editor felt a rare and inexplicable impulse to type it in.

The end result of all this is that much, perhaps most, of the upper-case Greek has been captured, usually correctly, but that the vast majority of lower-case Greek has not been. (and a little bit of Hebrew, again when it served a structural purpose, and was essential for navigation through the book.) Math only when it could readily be represented as running text, or as a TEI/HTML table. Music only when it consisted of individual notes or symbols, no full tablature.

The implication here is that the final TCP texts are Unicode compliant, but only some of the characters from the full Unicode set make it into the TCP transcriptions. So Greek block letters (Γ, Σ, Φ, Ω) are sometimes transcribed, but the corresponding lowercase ligature symbols (γ, ς, φ, ω) are not.[2] Hebrew characters (א,ק,ש,מ) are very, very rarely transcribed, but Hebrew transliterated into Latinate characters will have been transcribed. Arabic is also notably absent, even though it is available in Unicode and is visible in several EEBO texts.

Using the JISC historical books (UK subscribing institutions) interface, a user is able to look at the TCP transcripts alongside the images from EEBO, making it possible to compare the print to the transcriptions. Here are some examples of what I mean when I say ‘untranscribed’.

This trilingual page from TCPID A90307 shows text in Latin, Greek script, and Hebrew:
A90307 trilingual transcribe issueAs we can see, the Greek and Hebrew print in this text are rendered as […], even though there is printed material there. A second example, a few pages later in the same book, is even more illustrative of this issue: the Arabic page shown below is listed as unavailable, even though there is definitely print there, and the next page, in Latin, is transcribed.

A90307 arabic problem

Even though they are available in the Unicode character set, these scripts are erased completely from the Text Creation Partnerships’ texts.

I now want to mention some exceptions which are notable. As I’ve stressed, very little of the Hebrew is transcribed unless it’s been transliterated into Latinate characters, such as in TCPID A37959, a three column book which offers translations between Latin, Hebrew and Welsh:

A57259 transliterated hebrew

Syriac and related languages rendered in Latinate characters are also transcribed.

And as Paul and Chris suggest, block Greek characters can be transcribed, as this page from TCPID A57729 shows:
A57729 greek block transcribed
but, as the TCP guidelines recommend, the Greek ligatures later down the page are not transcribed and are instead replaced with […]:
A57729 greek script untranscribed(The transcriptions do not retain the structure of the printed object!)

There’s also at least one fake alphabet in EEBO which is definitely not transcribed as there are no corresponding Unicode characters. But much of that book (TCPID A57259) is left untranscribed – though to its credit, […] is used for Arabic as opposed to pretending it’s not there at all. When present, Greek block characters in this book get transcribed too. It even contains this hyper-detailed transcription of a stylised alphabet, suggesting the definition of a Roman letter can be quite flexible:
Screen Shot 2016-09-23 at 5.42.04 pm

(Decorated initials are also recorded, in case you were curious.)

Of the languages discussed so far, Greek is by far the most common. Lots of evidence of historical printed Greek has been removed from the TCP corpus as a result of these rules. Although I haven’t seen a book printed entirely in Greek ligature yet, it wouldn’t be out of the question; plenty of texts I did look at contained lines, passages, paragraphs, or columns in ligature which are untranscribed and therefore eliminated from the corpus. Arabic and Hebrew are far less frequently found in EEBO, but it’s good to know they’re there. There may be other missing languages I don’t know about because they aren’t available in multilingual texts which are partially transcribed.

For our purposes, this is not necessarily a bad thing. VEP privileges English-language printed material in our pipeline, as our resources are designed to get high-accuracy ways of visualising, exploring and understanding English-language print from the TCP texts. At least 99% of Early English print is available in Roman characters, and even including Latin, we have a very high level of accuracy and coverage of the TCP data.

Individually, non-Roman glyphs may not represent massive amounts of early printed material, but in aggregate I’d estimate non-English print in EEBO is going to be represent maybe 1% of the entire EEBO-TCP corpus. This is still not a huge – or even meaningful – number, unless you study these languages, in which case it will be a big loss to you! But we will be releasing all the multi-lingual texts in the TCP collection soon, including some notes on languages that do not make it into the transcriptions, meaning you will soon be able to conduct your own investigations on these multi-lingual documents.


[1] A small number of alchemical symbols are also available as emoji these days, and confusingly can render as emoji in the transcriptions. They are: ♈ (Aries), ♉ (Taurus), ♊ (Gemini), ♍ (Virgo), ♎ (Libra), ♏ (Scorpio), ♐ (Sagittarius), ♑ (Capricorn), ♒ (Aquarius), ♓ (Pisces), ♌ (Leo). A full list of Unicode-compliant alchemical symbols is available from

[2] For more on lowercase ligature characters in Greek, see the description of Aldine-style characters by Jane Raisch for the JHI: This essay is an interesting discussion of Greek in Early Modern print more generally, and worth a read if you are interested in the how/why/what of Greek language printing.