Words / NotWords for Plays and All TCP

Let’s try the words/not words game again…

This time I will start with the entire TCP corpus (61315 documents – this includes some Evans and some ECCO). There are 6065951 different words (which is a lot – it gives us a 61315×6065951 matrix). But most words occur once in the whole corpus. I’ll limit things to words that appear in 5 or more documents (there are still 909665 words). Also here I am referring to different words (e.g. “the” counts once) – the total “word count” is 1,541,992,473 (e.g. “the” occurs 89821184 times).

For my subset, I’ll take the drama collection. While there are 1244 plays in the Extended Drama 1700 corpus, there are only 1020 TCP entries (since many plays are in multi-play volumes). Note that I am using entire TCP files for this experiment (so if there is some non-plays mixed in with the plays, I am counting that).

The plays represent 1.56% of the TCP (in terms of word count, after throwing out words that appear in fewer than 5 documents).

Now we can play our words/notwords game… What words are used most in plays (relative to their total usage).

Spoiler alert: I am not going to find too much that’s interesting…

It turns out that there are 32 words that are used only in plays (and remember, this means 5 or more plays, since we excluded words that occur in fewer than 5 documents).

[(1.0, 18, 37, 'foot-marshal'),
 (1.0, 9, 41, 'dsdeath'),
 (1.0, 9, 17, 'budg-batchelors'),
 (1.0, 9, 9, 'three-crane-wharf'),
 (1.0, 8, 30, 'dslife'),
 (1.0, 8, 18, 'yfayth'),
 (1.0, 8, 8, 'theater-royal'),
 (1.0, 7, 10, 'waiting-womans'),
 (1.0, 7, 8, 'shoulder-scarf'),
 (1.0, 7, 7, 'prithe^'),
 (1.0, 7, 7, 'exennt'),
 (1.0, 6, 12, 'ownds'),
 (1.0, 6, 11, 'wudst'),
 (1.0, 6, 11, "shan'not"),
 (1.0, 6, 9, 'pellited'),
 (1.0, 6, 7, 'whooh'),
 (1.0, 6, 6, 'skin-coate'),
 (1.0, 6, 6, 'bawdship'),
 (1.0, 5, 50, 'lucind'),
 (1.0, 5, 8, 'undisguises'),
 (1.0, 5, 7, "yow'le"),
 (1.0, 5, 7, "deserve'em"),
 (1.0, 5, 6, 'skirvy'),
 (1.0, 5, 6, 'fackings'),
 (1.0, 5, 5, 'y^aith'),
 (1.0, 5, 5, 'theatre-royall'),
 (1.0, 5, 5, "th'help"),
 (1.0, 5, 5, 'i*th'),
 (1.0, 5, 5, "hee'de"),
 (1.0, 5, 5, 'faintst'),
 (1.0, 5, 5, 'drawes'),
 (1.0, 5, 5, 'black-fryers-stairs')]

The most common words that don’t appear in plays? The numbers are not a surprise. “Abovementioned” is interesting (and makes sense).

[(0.0, 3884, 20419, 'hebr'),
 (0.0, 3809, 12209, 'hezekiah'),
 (0.0, 3621, 15286, 'ezra'),
 (0.0, 3427, 12124, '3.16'),
 (0.0, 3291, 11042, '1.5'),
 (0.0, 3250, 9704, '2.3'),
 (0.0, 3224, 10329, '2.2'),
 (0.0, 3220, 8945, 'abovementioned'),
 (0.0, 3179, 10275, '3.1'),
 (0.0, 3119, 10031, '3.5'),
 (0.0, 3102, 24106, 'esa'),
 (0.0, 3097, 9560, '2.1'),
 (0.0, 3038, 8862, '2.4'),
 (0.0, 3020, 8639, '2.13'),
 (0.0, 3008, 7163, 'long-suffering'),
 (0.0, 2980, 7576, 'micah'),
 (0.0, 2972, 8629, '3.2'),
 (0.0, 2966, 8331, '3.8'),
 (0.0, 2965, 8506, '3.15'),
 (0.0, 2963, 8884, '2.10')]

To get something more meaningful, let’s limit ourselves to words that appear in many plays, but not many non-plays. Here I limit us to words that occur 20 or more times:

[(0.98181818181818181, 28, 55, 'dyes'),
 (0.97959183673469385, 37, 98, 'sfoote'),
 (0.9666203059805285, 22, 719, 'iord'),
 (0.9322709163346613, 28, 251, 'eup'),
 (0.92708333333333337, 33, 96, 'vmh'),
 (0.92564102564102568, 33, 390, 'iph'),
 (0.92356687898089174, 33, 157, 'wonot'),
 (0.92307692307692313, 31, 78, 'shannot'),
 (0.89333333333333331, 35, 75, "s'foot"),
 (0.88741721854304634, 28, 302, 'borg')]

(A reminder, this means that “borg” appears 302 times in TCP across 28 documents. 88.7% of those 302 times are in plays.)

At the other end of the list, we get words that appear a lot in non-plays, but not much in plays:

[(4.3112739814615219e-05, 5121, 23195, 'eccl'),
 (4.2984869325997249e-05, 3739, 23264, 'isai'),
 (4.0666937779585194e-05, 150, 24590, 'chwi'),
 (4.0412204485754696e-05, 6211, 24745, 'isaiah'),
 (3.389141191622043e-05, 165, 29506, 'wrth'),
 (2.5357541332792372e-05, 185, 39436, 'hyn'),
 (2.5131942699170645e-05, 137, 39790, 'efe'),
 (2.2714366837024417e-05, 6829, 44025, 'sanctification'),
 (2.0277805941397142e-05, 135, 49315, "a'r"),
 (1.1377793959529188e-05, 328, 175781, 'yr')]

As a check, the word “the” has .825% of its occurrences appearing in plays. This is about half of what we might expect (since the plays are 1.5% of the words). In contrast, “a” is 1.9% (or higher than expectation).

Author: Mike Gleicher

Mike Gleicher is Professor of Computer Sciences, and is founder of the UW Graphics Group. He's interested in most things involving the creation of use of pictures: visualization, animation, stylized rendering, photography and videography, etc. Lately, Mike has been thinking about visual comparisons, rethinking photography, animating social behaviors, and improvisational presentations. He came to Wisconsin in 1998 after spending time at Apple, Autodesk, Carnegie Mellon, and Duke.

Leave a Reply