Launch TextDNA // Text DNA GUI Overview // Raw Text Manipulation // About the Project // Download from Github


News

October 10, 2016: TextDNA is now available to download for personal use! You can make your own datasets! Learn more at the Github page.

The Basics

TextDNA allows users to explore and analyze word usage across text collections of varying scale. With TextDNA, users can compare word usage between document collections (e.g., across different decades), between individual documents, or between elements within a document (e.g., chapters or acts). Word usage can be explored across raw texts, i.e., text documents not subject to processing. Additionally, word usage can be explored across different metrics, such as how frequently words are used within a document.

TextDNA is based on the Sequence Surveyor genomics analysis system, which provides overview visualizations to elucidate large-scale patterns across multiple genome sequence alignments. Like genomes, texts can be thought of as distinct sequences of data. As bacteria strains can be distinguished by their DNA sequences, texts can be distinguished by their sequences of words. TextDNA visualizes sequences of text in parallel, allowing users to detect word usage patterns.

TextDNA displays information about word usage in text sequences through aggregating position and color. Each unique set (or sequence) of words is mapped to a colored row. Each word of the set, in turn, is mapped to colored blocks within its row. The aggregation of color and position allow words within a row to be ordered and recolored according to different data properties. These variable encodings empower the user to examine their data from multiple angles to scrutinize global trends and outliers.

Information about the previous version of TextDNA, including download information, can be found here. The version is offered as is.

Specifications: TextDNA utilizes the GPU to provide interactivity in the web browser. As a result, the program is resource intensive. It works best in Google Chrome. TextDNA also works in Mozilla FireFox and Internet Explorer 11, and has limited functionality in Safari.

Tutorial

Feeling overwhelmed? Try the Simple Test Dataset with a downloadable tutorial and Simple Test Dataset data summary.

Datasets

TextDNA includes several sample datasets. If you would like to see another dataset, contact us to work out the details.

Shakespeare (Raw Text) provides the raw text for 36 of Shakespeare's plays with the stopwords removed. The "rank" of a word corresponds to its position within the text.

Top 200 Google N-Grams (1660-2009) aggregates the one-grams from the Google Books Dataset by decade between 1660 and 2009. The top 200 words in each decade are visualized as a sequence, with decades ordered chronologically, and the "rank" of a word corresponds to its frequency in the dataset for that decade.

Top 1000 Google N-Grams (1660-2009) aggregates the one-grams from the Google Books Dataset by decade between 1660 and 2009. The top 1000 words in each decade are visualized as a sequence, with decades ordered chronologically, and the "rank" of a word corresponds to its frequency in the dataset for that decade.

Top 5000 Google N-Grams (1660-2009) aggregates the one-grams from the Google Books Dataset by decade between 1660 and 2009. The top 5000 words in each decade are visualized as a sequence, with decades ordered chronologically, and the "rank" of a word corresponds to its frequency in the dataset for that decade.

TCP Top 1000 N-Grams (1470-1830) aggregates the top 1,000 one-grams from all TCP texts arranged by decade.

Core Drama 1660 N-Gram by Genre aggregates one-grams from Jonathan Hope's Core Drama 1660 corpus.

Early Modern Playwrights (Top 5000 N-Grams) aggregates one-grams of collected plays of selected playwrights from Jonathan Hope's Early Modern Drama 1554 corpus. For this dataset, playwrights were selected based on the criteria that they had 10 or more play files available. The top 5000 words used by these playwrights in the corpus are visualized as a sequence, with the "rank" of a word corresponding to the word frequency in the dataset for that playwright.

Early Modern Dramas (By Decade) aggregates one-grams of plays sorted by decade in Jonathan Hope's Early Modern Drama 1554 corpus. For this dataset, playwrights were selected based on the criteria that they had 10 or more play files available. The top 5000 words used in plays per decade, with the "rank" of a word corresponding to the word frequency in the dataset for that decade.



 

Project Page | Sequence Surveyor | TextDNA


Email danielle.szafir@colorado.edu for more information.