LSA is an older corpus processing method – it’s kind of gone out of favor for things like topic modeling – but I like it as an illustration because it is very simple. And to use it, we have to make all the critical decisions we need to do with any statistical modeling thing. And of course, simple experiments are good for exposing errors in our data and process (fixes were made to my code in the process of doing this, but the data is standing up).
LSA is basically performing PCA on the word count vectors. So, it has all the problems of both (it only considers word counts, and its only looking for variance).
First, I need to pick which documents. You might say “that’s easy – use all of EEBO-TCP,” but since I know there are things in the corpus that might throw me off, I might as well cull things first.
For now, I’ll first remove all documents for which our metadata tells us the document is not in English (since I am interested in English). And I’ll remove all documents shorter than 250 words (that number is arbitrary) – since if documents are too short normalization might get weird. (If a document is 5 words long, then each word is 20% of it.) Note these are arbitrary choices – they affect the results, and I should remember that I did them. But it would be appropriate to make different choices.
My data is now a 52049 x 6065951 matrix of 52049 documents in EEBO-TCP that are in English and long enough, 6065951 words to be counted – although this is the number in the whole of TCP since I didn’t remove the words that only appear in this corpus). I could remove words that don’t appear a lot – this is a good idea if for no other reason that it makes things go faster (dealing with 6 million columns takes time). It also helps because the small values for uncommon words not only won’t matter in the answers (since the goal is to find the big trends), and that these small values can actually mess things up.
So, I’ll as a next step get rid of uncommon words. There are many ways to do this – for now, I’ll only allow words that appear in 10 or more documents. That’s still a lot of words (447215).
As a sanity check, I can see if any documents are severely affected by this by seeing what document has the lowest percentage of words that remain after removing those words that occur in 10 or fewer documents. The worst document only retains 58% of its words (it’s A12751, “‘A brief treatise for the measuring of glass, board, timber, or stone, square or round being performed only by simple addition and substraction, and that in whole numbers, with[o]ut any multiplication, or division at all / by John Speidell …” which has a huge table of numbers at the end.
If you’re curious, on my (relatively fast with lots of memory) computer, it takes 132 seconds to run LSA (for 20 vectors – I’ll explain in a moment) on the whole massive matrix, on the reduced (10 or more documents), it takes 56 seconds (about half). In case you’re wondering… this isn’t too surprising. Since those rare words were so rare, the computations were smart enough not to waste much time on them.
What can we learn from this first model?
LSA (or at least my implementation) of it is a little different than PCA in that I don’t mean center things before analyzing the matrix. This difference is small, since the mean comes out as the first component – but it can lead to other subtle differences.
The zeroth component is a vector that says “if you could pick only one vector to make the books” (each book is a scaled amount of this vector) which should you pick? The right answer is to pick the average book. So we’d expect that the zeroth vector to be proportional to the average over the books. So the most common words come out as the “strongest” in this vector. While the vector is big (it has a value for each of 447215 words), we can just look at the biggest entries:
[('the', 0.61865592352172205), ('of', 0.40348356075050867), ('and', 0.38341452460366676), ('to', 0.26211624561016428), ('in', 0.18697302248166897), ('that', 0.1653677553047907), ('a', 0.13252050497989859), ('is', 0.11119574845880302), ('it', 0.10087079005559051), ('he', 0.097740434591416875)]
What this tells us is that if you we want to describe books (or their word vectors) with 1 number, this is the best vector to multiply that number by.
Usually, with PCA we ask about the documents. But here, it won’t tell us much: it will mainly tell us how long the books are. Since our book vectors are word counts, longer books will use more of the “average” stuff. The number with the biggest amount is the longest document.
But, if a document is weird (unlike the average) it might use more or less than you’d expect given how long it is. So we could normalize by the document length. Documents that score low on this are ones that don’t use the common words as much as you would expect (given their lengths). The document with the lowest score? A64613… ‘An useful table for all uictuallers & others dealing in beer & ale’ – which is a big table of numbers.
So before doing PCA/LSA we need to normalize for the length of the books. I will do this by dividing each row of the matrix by the length of the book in words (the L1 norm) – using the word count BEFORE we threw out uncommon words.
So now I am building my first “real” LSA model: I chose EEBO-TCP documents that I know to be in English, only considered documents with more than 250 words, and am normalizing the the original word count of the document.
The zeroth vector is actually slightly different – suggesting that long documents use different proportions of words, although this isn’t necessarily something meaningful or statistically significant.
[('the', 0.59010211288364689), ('and', 0.42698349761656479), ('of', 0.39453632948464623), ('to', 0.28559405040930685), ('in', 0.18348404712929461), ('that', 0.15612291442776627), ('a', 0.13154084693700221), ('be', 0.10011301035536961), ('is', 0.097944973306089519), ('for', 0.096814220825602454)]
Again, this is just the top 10 words.
When I look at documents that are “strong” in this vector, its documents that are made up more common words than normal (with a big weighting on “the”). Here are the top few…
A28654 0.1983 A plain and easy rule to rigge any ship by the ... A96930 0.1813 The use of the universal ring-dial. B05531 0.1789 Proclamation anent the rendezvouses of the militia... B04164 0.1780 Londons ordinary: or, Every man in his humor... A92666 0.1768 A proclamation anent the rendezvouses of the militi... B24252 0.1765 An explanation of Mr. Gunter's quadrant, as it ... A72802 0.1764 Londons ordinary, or every man in his humor To ... A66689 0.1764 To the Most Excellent Majesty of James the IId ... A01661 0.1761 This book does create all of the best waters ... B05530 0.1748 Proclamation anent the rendezvous of the militia, ...
That first one has a lot of “the” in it (as expected). I didn’t check the others, but there seems to be a theme.
If we look at the bottom scorers, we see books that don’t have many common words…
A64613 0.0034 An useful table for all uictuallers & others dealing in beer & ale A28791 0.0039 A Book of the valuations of all the ecclesiastical preferments in England and Wa A61096 0.0045 Villare Anglicum, or, A view of the towns of England collected by the appointmen A88337 0.0062 A list of the imprisoned and secluded Members. A28785 0.0067 A Book of the names of all parishes, market towns, villages, hamlets, and smalle A95633 0.0068 A table of excise for strong beer and ale, for common brewers at 3s. 3d. the bar B09789 0.0081 A list of the English captives taken by the pirates of Argier, made public for t A95631 0.0082 A table of excise for small beer for common brewers at 9 d. the barrel, with the A28887 0.0082 An exact alphabetical catalogue of all that have taken the degree of Doctor of P A48671 0.0082 A list of the names and sums of all the new subscribers for enlarging the capita
Again, not surprising… these are books of lists (so they have no need for many articles).
Now if we look at the next principle vector (the 1st LSA vector – since the previous was the zeroth) we would expect to see the words where we have the most variance from the average. Since words that occur often will vary a lot in how often they occur (a small percentage of variance in a common word is still a lot of variation), we expect to see them again. And we do:
[('the', 0.38420645622523014), ('of', 0.32922762730011762), ('your', -0.084669152332760914), ('his', -0.088128326440172142), ('so', -0.089329190198924976), ('with', -0.092148634300003029), ('for', -0.1008359401690339), ('will', -0.11693098516660286), ('me', -0.11984594738903771), ('not', -0.14030401361190545), ('but', -0.14408743558833309), ('it', -0.15588034041694304), ('that', -0.16401056893070368), ('is', -0.17369018634007896), ('he', -0.1772685229064182), ('you', -0.18990955717829638), ('my', -0.19381100534134885), ('to', -0.22065219350754178), ('a', -0.25761137977106813), ('i', -0.37919606548968343)]
What’s interesting here – “the” and “of” are on top, and words that are more personal are at the bottom. If you’re writing a lot about yourself or people, you don’t use as many definite articles. If you write about “the X of Y” a lot, you aren’t going to say much about “I, you, …” So the top on this list is book that are declaring facts:
A28654 0.1152 A plain and easy rule to rigge any ship by the length of his masts, and yards, w A66689 0.1107 To the Most Excellent Majesty of James the IId by the grace of God of England, S A18462 0.1027 The Imperial acheiuement of our dread sovereign King Charles together with ye ar B05531 0.0921 Proclamation anent the rendezvouses of the militia, for the year 1683. A92666 0.0907 A proclamation anent the rendezvouses of the militia, for the year 1683; Proclam B05530 0.0900 Proclamation anent the rendezvous of the militia, for the year, 1684. A93594 0.0886 A survey of the microcosme. Or the anatomy of the bodies of man and woman wherei A93593 0.0866 An exact survey of the microcosmus or little world being an anatomy, of the bodi A39982 0.0844 The form of the proceeding to the coronation of Their Majesty's, King James the B03350 0.0830 The form of the proceeding to the coronation of their Majesty's, King James the
And the bottom?
B01741 -0.0804 Tobia's advice, or, A remedy for a ranting young man. While you are single you t A75449 -0.0744 An Answer to unconstant William, or, The Young-man's resolution to pay the young B00386 -0.0733 A new ditty: of a lover, tossed hither and th[i]ther, that cannot speak his mind A15699 -0.0726 The honest vvooer his mind expressing in plain and few terms, by which to his mi B01734 -0.0713 Doubtful Robin; or, Constant Nanny. A new ballad. Tune of, Would you be a man of A06395 -0.0712 The lovers dream who sleeping, thought he did embrace his love, which when he wa B00382 -0.0712 The lovers dream: who sleeping, thought he did embrace his love, which when he w B02419 -0.0712 The country-mans care in choosing a wife: or, A young bachelor hard to be please B04151 -0.0710 The London lasses lamentation: or, Her fear she should never be married. To the B02416 -0.0708 The country lovers conquest. In winning a coy lass ..., To a pleasant new tune,
Notice that this is all based on the most common words – they will drown out all the uncommon words.
Usually in modeling (like LSA), we get rid of the common words. Here, we can use them to learn things about the corpus and to help organize it.
Of course, with a PCA like thing, the first thing someone wants is a scatterplot. With 50,000+ points, it’s a blotch. In time, we need some interactivity. You can identify the extreme points from the lists above.
Let’s try to understand why PCA puts “the” and “of” together. Here’s a similar plot of “the” on the X axis and “of” on the Y.
Not hugely correlated (it’s hard to see with the density). But compare that to a plot of “the” vs. “i”:
This is pretty striking, because there are many documents that don’t use “i” at all. And some that use I a lot. Yes, there is a document where 10% of the words are I!
A75449 0.1002 An Answer to unconstant William, ... B03187 0.0977 An excellent new song, called, ... B04151 0.0970 The London lasses lamentation: or, ... A73881 0.0968 A Sweet and pleasant sonet, entitled, ... B00386 0.0867 A new ditty: of a lover, tossed hither ... B06543 0.0852 Where Helen lies. To an excellent new ...
Or a document that is nearly 20% “of”!
A75823 0.1917 An Account of what captives has been ... A38857 0.1825 An exact account of the number of ... A92491 0.1809 Act for raising four months supply. ... A31306 0.1798 A Catalogue of the prelates and clergy ... A38907 0.1734 An Exact catalogue of the names of ...
Can these kinds of analysis of common words tell us anything about thelarge corpora?