TextDNA: Raw Text Manipulation
Functions Overview
TextDNA performs several statistical manipulations on text input. It can be difficult to understand how these manipulations apply to raw text as opposed to n-gram input. This page will offer a digestible example of how nine lines of text are manipulated through the below functions. Comparing the function output to the original output in this example can help you gain an understanding as to how TextDNA arranges text input.
- Word Frequency: The number of times a word appears within the entire dataset, calculated by the program.
- Sequence Frequency: The number of sequences a word appears in.
- Sequence Co-Occurence: The co-occurence of words, arranged first by the number of sequences a word appears in then by the set of sequences each word appears in, arranged overall in alphabetical order. It generates a unique column for each word.
Text Sequences
Pretend the following nine lines each represent a sequence in TextDNA. These lines are taken from a scene in Hamlet.
- Ay, my lord.
- It might, my lord.
- Not a jot more, my lord.
- Ay, my lord, and of calf-skins too.
- What's that, my lord?
- E'en so.
- E'en so, my lord.
- 'Twere to consider too curiously, to consider so.
- That is Laertes, a very noble lord: mark.
Word Frequency
- lord (7), my (6), ay (2)
- lord (7), my (6), it (1), might (1)
- lord (7), my (6), a (2), jot (1), more (1), not (1)
- lord (7), my (6), ay (2), too (2), and (1), calf-skins (1), of (1)
- lord (7), my (6), that (2), what's (1)
- so (3), e'en (2)
- lord(7), my (6), so (3), e'en (2)
- so (3), consider (2), to (2), too (2), curiously (1), 'twere (1)
- lord (7), a (2), that (2), is (1), Laertes (1), mark (1), noble (1), very (1)
Sequence Frequency
- lord (7), my (6), ay (2)
- lord (7), my (6), it (1), might (1)
- lord (7), my (6), a (2), jot (1), more (1), not (1)
- lord (7), my (6), ay (2), too (2), and (1), calf-skins (1), of (1)
- lord (7), my (6), that (2), what's (1)
- so (3), e'en (2)
- lord (7), my (6), so (3), e'en (2)
- so (3), too (2), consider (1), curiously (1), to (1), 'twere (1)
- lord (7), a (2), that (2), is (1), Laertes (1), mark (1), noble (1), very (1)
Sequence Co-Occurrence
7 | 6 | 5 | 4 | 3 | 2 | 1 | |
seq. 1 | lord (1,2,3,4,5,7,9) | my (1,2,3,4,5,7) | ay (1,4) | ||||
seq. 2 | lord (1,2,3,4,5,7,9) | my (1,2,3,4,5,7) | it (1), might (1) | ||||
seq. 3 | lord (1,2,3,4,5,7,9) | my (1,2,3,4,5,7) | a (3,9) | jot (1), more (1), not (1) | |||
seq. 4 | lord (1,2,3,4,5,7,9) | my (1,2,3,4,5,7) | ay (1, 4), too (4,8) | and (1), calf-skins (1), of (1) | |||
seq. 5 | lord (1,2,3,4,5,7,9) | my (1,2,3,4,5,7) | that (5,9) | what (5) | |||
seq. 6 | so (6,7,8) | e'en (6,7) | |||||
seq. 7 | lord (1,2,3,4,5,7,9) | my (1,2,3,4,5,7) | so (6,7,8) | e'en (6,7) | |||
seq. 8 | so (6,7,8) | too (4,8) | consider (1), curiously (1), to (1), 'twere (1) | ||||
seq. 9 | lord (1,2,3,4,5,7,9) | a (3,9) | is (1), Laertes (1), mark (1), noble (1), very (1) |
Email danielle.szafir@colorado.edu for more information.