TextDNA Instructions for Use



Data Format

TextDNA reads SQLite databases representing ordered sequences (genomes) of elements (genes) and optionally subsequences within the original sequences (chromosomes). For the purpose of comparison between elements of different sequences, matching sets of elements are identified by an integer label (the ortholog_group_id).

The basic schema for the database file is as follows (elements in italics are descriptions of the column data and NOT part of the schema):

TABLE genome

genome_id

integer PRIMARY KEY

Default ordering of the sequences. Should start with zero and continue in counting order (i.e. 0, 1, 2, ...).
name text name of the sequence

TABLE chromosome

chromosome_id integer PRIMARY KEY Numeric identifier for a subsequence. Should start with zero and continue in counting order (i.e. 0, 1, 2, ...).
name text name of the subsequence
genome_id integer genome_id of parent sequence
length integer number of elements in the subsequence

TABLE gene

gene_id integer PRIMARY KEY Numeric identifier for an element. Should start with zero and continue in counting order (i.e. 0, 1, 2, ...).
name text name of the element
description text textual description of the element
chromosome_id integer chromosome_id of the parent subsequence
index_in_chromosome integer order of the element within the subsequence
start integer absolute starting position of the element within the subsequence
end integer absolute ending position of the element within the subsequence
strand integer 0 or 1
ortholog_group_id integer Numeric identifier for a set of matching elements. Should start with zero and continue in counting order (i.e. 0, 1, 2, ...).

The Google N-Grams data included with this software represents the 1,000 most popular words per decade from 1660 to 2000 according to Google (see http://books.google.com/ngrams/info for the original count data). In the case of the Google N-Grams data, column data is as follows:

TABLE genome

Describes the properties of a decade

genome_id

integer PRIMARY KEY

Default ordering of the decades
name text Name of the decade (e.g. 1660)

TABLE chromosome

A second object for defining decades
chromosome_id integer PRIMARY KEY Numeric identifier for the decade
name text Arbitrary name
genome_id integer Numeric identifier of the decade in the genome table
length integer The number of words in the decade (e.g. 1000)

TABLE gene

Describes information about an instance of a word
gene_id integer PRIMARY KEY Numeric identifier for a word
name text The word itself
description text Additional information about the word from the Google database
chromosome_id integer The chromosome_id of the parent decade
index_in_chromosome integer The rank of of the popularity of the word within the parent decade
start integer The normalized inverse log of the number of instances of the word in the parent decade
end integer The normalized inverse log of the one more than number of instances of the word in the parent decade
strand integer 1
ortholog_group_id integer A numeric identifier for all instances of the word (i.e. all instances of 'the' map to 1)


Additional data about the elements can be added to the gene table by adding a new column. Any additional information must be of a numeric form (one of "int", "integer", "tinyint", "smallint", "mediumint", "bigint","unsigned big int", "int2", "int8", "real", "double", "double precision", "float", "numeric", "decimal(10,5)", "boolean") and can then be used as a mapping to color individual elements in the tool (see "Customizing the Display").

Data can also be added through CSV files. The files must be of the format:

<set_name>, <element_name>, <position_in_set>, <count_value>

where each entry is located on a separate line. The number of elements loaded can be set by setting an upper bound on the "rank" value (see "Customizing the Display"). All datasets must use matching element_name values to define matching element names.

 

 


Display Overview

TextDNA
The default TextDNA display window.

TextDNA leverages multiple visual displays to support visual analysis. The above image is the default TextDNA screen that appears when first loading a dataset. The components are as follows:

Interacting with different elements of the display can affect other display components. For more detailed information on interaction with the various componenets of the display, see the ensuing sections of this guide.

 


Basic Use

To load a dataset into TextDNA, select the "Load Dataset" tab from the Navigation Pane. Click on the "Load" button to open the load file dialog box and navigate to the desired dataset on your computer (NOTE: files must have .db, .sql, or .sqlite extensions). Select the file and click "Open" to load the data into the tool.

The default display orders sequence rows in the Primary Display by their genome_id field (see "Data Format" for details). Rows can be rearranged by holding down the Alt key and clicking and dragging sequence rows vertically within the display. The names of each sequence are displayed in the Label Bar beside their respective sequence rows. Within each row, neighboring elements are grouped into blocks, collections of data elements aggregated spatially according to the Block Width parameter (see "Customizing the Display" for details). By default, each block is displayed as the average color of genes within the region represented by the block; however, alternative visual representations of the blocks are available.

Blocking

Elements within a particular spatial region are grouped into a single glyph, called a block. The width and representation of a block can be controlled by the user.

Mousing over a block in the Primary Display will reveal a tooltip containing the names of elements within the block, followed by the subsequence name and sequence name on a separate line. If there are a large number of elements within a block, the list of names may be truncated, as indicated by an ellipse. Mousing over a block will also highlight any blocks in the Primary Display containing at least one element matching an element in the current block, the locations of the elements in the Histogram display, and the corresponding elements in the Zoom Window, if locked. Clicking on a block in the Primary Display will cause lines to be drawn between all blocks containing at least one element matching an element in the clicked block. Links are only drawn between blocks sharing a matching element. Clicking on the block again will remove the lines. Right clicking on any block and selecting "Clear Links" will remove all lines from the Primary Display.

Detailed information about the contents of a block can be accessed through the Primary Display. Right clicking on a block and selecting "View Component Text" will display the names of elements within a block, separated by blank lines, in a pop-up window. Right clicking and selecting "Details" will open a pop-up containing a list of sequences in which elements matching the components of the clicked block can be found. Selecting a sequence name displays information about the matched element in that sequence, including information from the description field of the database. Elements are indicated by '*' and blocks are separated by '***'. The blocks from the genome initially selected are preceded by "Target:" in the genome window.

Elements contained within a block can also be explored in the Zoom Window. Mousing over a block in the Primary Display will cause the components of the block to be displayed in the Zoom Window. Elements are displayed on either side of a horizontal guide line on the basis of their strand field (see "Data Format" for details). The elements are represented as colored rectangles: their color and positioning correspond to their color and position values as defined by the parameters of the Primary Display. To interact with the components of the locked block, right click on the block in the primary display and select "Zoom to Block". The block is then locked into the Zoom Window and outlined in the Primary Display. If the component elements in the Zoom Window are dense, they may be aggregated into blocks, in which case, a zoomed block can be further explored by again locking the block in the Zoom Window.

Once locked, mousing over an element will highlight blocks in the Primary Display containing that element, sequences in the Label Bar where the element is found and the location of the element in the Histogram distribution. Clicking on a block will draw a connecting link between blocks containing that element in the primary display. Clicking again will remove the link. Right click on any block in the Primary Display or Zoom Window and select "Unlock Zoomed Block" to release the block from the Zoom Window and return to mouseover zoom in the Primary Display.

Mousing over a bar in the histogram display creates a tooltip with the list of ortholog_group_id values of the matched sets of elements represented by the bars. It highlights sequences containing those elements in the Label Bar and the corresponding blocks containing the elements in the Primary Display. Clicking and dragging the mouse in the Histogram display draws a red rectangle in the Histogram. Releasing the mouse filters the data in the primary view: blocks containing elements corresponding to the bars within the rectangle remain fully opaque, while all other blocks are made partially transparent. Pressing escape removes this filter and clears the Histogram display.

 


Customizing the Display

The Navigation Pane contains a variety of menus for customizing the display. Selecting a tab in the Navigation Pane will open the corresponding menu set. This section will discuss these menus and how they can be used to support analysis in TextDNA. See the paper for details and examples of the types of analysis that these customizations can support.

Load Dataset

The Load Database Menu

The "Load Dataset" menu contains options for interacting with and changing the underlying data of the display.

 

CSV Dataset Configuration Window

The CSV dataset configuration window. Files for the dataset are located in the upper window. The upper boundary on the rank value is set by the Top N value.

Properties

The Properties menu

The "Properties" menu allows the user to manipulate the display parameters of the Primary Display. The "Update" button adjusts the parameters of the display to match those set in this menu.

Color Scale

The Color Scale menu

The "Color Scale" menu displays the currently active color mappings for the Primary Display.

The Color Scale menu displays the detailed information about the current color mapping properties of the display: the active ramp is displayed from minimum property value to maximum and the minimum and maximum values are displayed below the ramp. For split color encodings, two ramps are used, one for the first portion of the data (that defined by the "Color Scheme" ramp) and the other for the second (defined by the "Secondary Scheme" ramp).

Filter Menu

The Filter Menu

The "Filter Menu" allows the user visually filter the data displayed in the Primary Display.

The Filter Menu allows the user to build conjunctive filters over the dataset. By entering element names in the "Words" field, the number of sequences a set of matching elements can be found in in the "Number of Decades" field, and/or the names of particular sequences to an element must match at least one component element in the "Reference Decades" field, a filter is constructed. Multiple elements can be entered into each field by separating values using a comma. Clicking the "Filter" button will reduce the opacity of any blocks not satisfying at least one of the filter criteria in the Primary Display, and any element not matching the criteria in the Zoom Window.

Clicking on "Remove Filter" will clear the current filter boxes and restore the opactity of all filtered blocks. The "Load Filter" button will open a load file dialog box. Opening a .filter file using this dialog will load the parameters of the filter into the "Filter Menu". The "Save Filter" button will open a save file dialog box to write out the filter parameters to a .filter file. All .filter files are of the format:

words: <comma-separated list of element names>
frequencies: <comma-separated list of frequencies (matching element set sizes)>
decades: <comma-separated list of sequence names>

Link Menu

The Link Menu

The "Link Menu" tracks a series of word pairs of interest across particular decades for later data processing.

 

The "Link Menu" can be used to build a .csv file of element pairs for later processing of the dataset. When a block is locked to the Zoom Window (see "Basic Use"), an element can be set as a link element. Linking two elements together opens a link dialog box where the user can select the series of decades over which the first element becomes linked to the second. When the user selects the "OK" button, the new link is then sent to the grid in the Link Menu.

Interface for adding element links.

The interface to add element links. By selecting a series of sequences on the left, the instances of the left-hand element in those sequences are linked to the right-hand element and sequence in the Link Menu.

 

Element in the Link Menu grid can be edited by selecting element and clicking "Edit". This will bring up the original link dialog box, where the user can select a different series of sequences overwhich to link the elements. Selecting a link in the grid and clicking "Remove" deletes the link. Selecting "Load Links" opens a load file dialog box, where the user can navigate to a .csv file of links and load them into the grid. Similarly, the "Save Links" button opens a save file dialog box where the user can save the current grid contents to a .csv file. The format of a link within a .csv file is: <first element name>, <first element sequence>, <second element name>, <second element sequence>. Each link is placed on a unique line. The .csv file can then identify word pairings of interest or be used in the data generation pipeline as a mechanism for refining later datasets.

 


 


 

Project Page | Sequence Surveyor | TextDNA

 

Adobe AIR must be installed to run Sequence Surveyor and TextDNA.

Email dalbers@cs.wisc.edu for more information.