Tokenization

Before Doing Things with Texts

written by: Deidre Stuffer

VEP often breaks texts down into their component words. While breaking down a text is computationally fast, the speed belies the difficult decisions behind exactly how to break down strings of characters.

The process of segmenting strings of characters into meaningful parts is called tokenization. The resulting meaningful parts are referred to as tokens, and what constitutes as a meaningful part must be specified computationally. A ‘word’ token, a computational representation of a word, can be thought of as sequence of letters separated by white space. As you read this sentence, your eyes rely on whitespace to distinguish words from another. Whitespace, however, is a simplification, given that the space between words is itself a character like punctuation marks.

Based on extensive examination of TCP digital texts, VEP implements the following definition of a word token: a word token is a sequence of at least 1 or more letters, carets, or numbers. The sequence may contain hyphens, apostrophes, and asterisks for punctuation, thus the punctuation must occur within the sequence and not at the beginning or end of the token.

Breaking down strings of characters into meaningful parts isn’t always a straightforward process. This document outlines what we identify as complications due to our definition of a word token, and how we handle them.

Issue 1: Punctuation

Defining the characters that comprise a word token requires specifying acceptable punctuation in addition to letters. VEP uses a conservative approach to punctuation. We allow only single occurrences of hyphens, apostrophes, and asterisks between letters, carets, or numbers. These punctuation marks are internal and contribute to the meaning of words. Hyphens are commonly found in compound adjectives (e.g., short-lived). Apostrophes can denote possession (e.g., girl’s) or elision (e.g., ne’er). Asterisks (*) are acceptable because it is a character that we insert into text to substitute for ambiguous punctuation in TCP texts. As a result, you may see words in VEP SimpleText files like “short*lived” and “broken*” that reflect where punctuation characters are unclear in TCP transcriptions. (We also substitute carets (^) for illegible characters in TCP transcriptions (e.g., sho^t-lived).)

Bounding punctuation, or punctuation that occurs at the beginning or end of a word (e.g., apostrophes), can function as part of or not part of the word. Apostrophes pose a difficulty for text extraction from our TCP source files. Without the context of a sentence, it is unclear whether an apostrophe is to function as a quotation mark or a plural possessive. As a result, we have decided to exclude all bounding punctuation from our word tokens. It reduces error in our text processing pipeline: spelling standardization then will not erase punctuation in our SimpleText files. Word tokens also can be correctly identified for standardization.

Implications

  • Bounding punctuation is not part of word tokens. Our TCP source files use apostrophes as opening and closing quotation marks. As written, our script cannot tell apostrophes as quotations from possessive apostrophes or elision apostrophes.
  •  Punctuation is treated as a token separate from word tokens and number tokens. Bounding punctuation, like commas (,) and apostrophes (‘), are treated as their own tokens. Sequential punctuation, like the dash (rendered as — in VEP SimpleText) and ellipses (…), is grouped together as one token.

The following table contains examples of how text formatted with punctuation in our source files will be turned into word and punctuation tokens.

TEXT AVAILABLE TOKENS
Space. space
.
Space! space
!
Space? space
?
Preserve, preserve
?
18-mile-long 18-mile-long
Short-lived– short-lived
‘Tis
tis
Parents’ parents
Parent’s parent’s
‘The
the
Space* space
*
Can*t can*t
^overnment ^overnment
be.Whether be
.
whether

ISSUE 2: Numbers

VEP does not allow commas or periods to be considered acceptable characters within word strings. Numbers, however, are formatted with commas and periods. As a result, VEP defines the behavior of number tokens as follows.

To extract number tokens with the least amount of error, VEP defines a number token as a sequence of at least 1 or more digits. This sequence may contain commas and periods. However, number tokens can contain a single instance of a punctuation mark surrounded by digits.

Implications

  • Number tokens are comprised purely of digits. Numbers are contained in a sequence with letters they are automatically classified as a word token. So “18-mile-long” is considered a word token in the section above, whereas if the text read “18 miles long,” then the number string “18” would be its own number token. The string “1st” also would be treated as a word token, but “1” is a number token.

The following table contains examples of how numerical text in our source files will be turned into tokens.

TEXT AVAILABLE TOKENS LIST
1. 1
.
1, 1
,
1,000 1,000
1.000 1.000
1,000. 1,000
.
1,000, 1,000
,
1,000; 1,000
;
1.1 1.1
1..1 1
.
.
1
1,1 1,1
1,,1 1
,
,
1

ISSUE 3: Capitalization

Capitalization in our SimpleText files does not always reproduce capitalization from the source files.

Rules:

  • if a word contains all capital letters in the source file, standardization capitalizes all letters in output (e.g., STORY > STORY)
  • if a word contains all lowercase letters in the source file, standardization reproduces all lowercase letters (e.g., story > story)
  • in the event that a word is neither all uppercase or all lowercase, standardization defaults to title capitalization (e.g., Story > Story; HisTory > History)

Putting It All Together

Below is an example sentence and a list of the word tokens and number tokens VEP generates to further analyse the TCP texts and generate statistics.

The example sentence is taken from TCP file A02945, THE PRINCIPAL NAVI­GATIONS, VOYAGES, TRAFFIQVES AND DISCOVEries of the English Nation (1599). TRAFFIQUES IS SUCH A GOOD WORD

And in the same year thy Majesty’s diak Boris Gregoriwich had for thy use 15. broad’cloths of diverse sorts, prised at 210. robles, whereof 90. robles are unpaid.

Sentence Tokens List
and
in
the
same
year
thy
majesty’s
diak
boris
gregoriwich
had
for
thy
use
15
.
broad’cloths
of
diverse
sorts
,
prised
at
210
.
robles
,
whereof
90
.
robles
are
unpaid
.

The list above contains 34 tokens total: 3 number tokens, 6 punctuation tokens, and 25 word tokens.

Tokenization Through Python Regular Expression

VEP tokenizes texts through a Python regular expression, or computational notation that instructs a computer what patterns of characters to look for. Erin Winter wrote a regular expression that VEP uses to partition our TCP source texts into tokens, following the description of tokens above. We provide the regular expression as documentation for use by others interested in parsing the texts available as part of the XML TCP files.

VEP Python Regular Expression for parsing ASCII texts
r'[0-9]+(?:[\,\.][0-9]+)+|[\w\^]+(?:[\-\’\*][\w\^]+)*’

Parsing Numbers

The first part of the regular expression parses numbers.

r‘[0-9]+(?:[\,\.][0-9]+)’

Main Parts of the Regular Expression

[0-9]+ 0-9 means any number from 0 to 9 inclusive
+ means 1 or more repetitionsThe first set indicates that a number token can be at least one or more digits.
(?:[\,\.][0-9]+) This sequence defines how and where punctuation can occur within numbers.

The set [\,\.] indicates that only commas (,) and hyphens (-) can be contained within numbers. Directly following the punctuation set is the set from the beginning. As a result, the defined punctuation must be contained within in the number and not the first or last character of the sequence.

Parsing Words

The second half of the regular expression parses words.

r'[\w\^]+(?:[\-\’\*][\w\^]+)*’

Main Parts of the Regular Expression

[\w\^]+ \w is shorthand for any letter A-Z, any letter a-z, and any number 0-9.
\^ means the caret character ( ^ )
+ means 1 or more repetitions

The set [\w\^]+ indicates that a word can be a sequence of at least 1 or more of the following: uppercase letters, lowercase letters, numbers, or carets. VEP substitutes unknown characters in TCP collections with carets.

[\-\’\*] \- means hyphen
\’ means apostrophe
\* means asterisks
[\w\^] \w is shorthand for any letter A-Z, any letter a-z, and any number 0-9
\^ means the caret character (^)

The portion (?:[\-\’\*][\w\^]+) of the regular expression defines what punctuation can occur within words and where.

The set  [\-\’\*] indicates that only hyphens (-), apostrophes (‘), and asterisks (*) can be contained within words. (VEP substitutes ambiguous punctuation notation in TCP collection texts with asterisks.)

Directly following the permitted punctuation set is the same set beginning the word regular expression ([\w\^]). As a result, the defined punctuation must be within a word and not be the first or last character of a word.

The asterisk * following the parenthesis means that the sequence (punctuation and character) can be repeated 0 or more times.

 

Resources

For an interactive introduction to regular expressions, visit here: http://regexone.com/lesson/introduction_abcs
Simple introduction to Python Regular Expressions: http://regexone.com/references/python