Sequence Surveyor Instructions for Use



Data Format

Sequence Surveyor reads SQLite databases representing a set of genomes aligned at the gene-level and optionally composed of multiple chromosomes. For the purpose of comparison between genes within different genomes, matching sets of genes (orthologs) are identified by an integer label (the ortholog_group_id).

The basic schema for the database file is as follows (elements in italics are descriptions of the column data and NOT part of the schema):

TABLE genome

genome_id

integer PRIMARY KEY

Default ordering of the sequences. Should start with zero and continue in counting order (i.e. 0, 1, 2, ...).
name text name of the genome

TABLE chromosome

chromosome_id integer PRIMARY KEY Numeric identifier for a chromosome. Should start with zero and continue in counting order (i.e. 0, 1, 2, ...).
name text name of the chromosome
genome_id integer genome_id of parent genome
length integer number of genes in the chromosome

TABLE gene

gene_id integer PRIMARY KEY Numeric identifier for a gene. Should start with zero and continue in counting order (i.e. 0, 1, 2, ...).
name text name of the gene
description text textual description of the gene
chromosome_id integer chromosome_id of the parent chromosome
index_in_chromosome integer order of the gene within the chromosome
start integer genomic coordinate where the gene begins
end integer genomic coordinate where the gene ends
strand integer 0 (backward strand) or 1 (forward strand)
ortholog_group_id integer Numeric identifier for a set of orthologs. Should start with zero and continue in counting order (i.e. 0, 1, 2, ...).

 

Additional data about the genes can be added to the gene table by adding a new column. Any additional information must be of a numeric form (one of "int", "integer", "tinyint", "smallint", "mediumint", "bigint","unsigned big int", "int2", "int8", "real", "double", "double precision", "float", "numeric", "decimal(10,5)", "boolean") and can then be used as a mapping to color individual genes in the tool (see "Customizing the Display").

This data format is extensible to other types of sequential data problems where you are comparing ordered sequences of data where there is a relationship between elements in different sequences, such as standard (base-level) alignments and aligned protein sequences or even something as different as rank data (see TextDNA). The sequences map to genomes, subsequences to chromosomes, and elements being compared map to genes. Groups of matching elements share the same ortholog_group_id value. However, any valid Sequence Surveyor dataset must, at a minimum, have the above fields.

 


Display Overview

Sequence Surveyor
The default Sequence Surveyor display window.

Sequence Surveyor leverages multiple visual displays to support visual analysis. The above image is the default Sequence Surveyor screen that appears when first loading a dataset. The components are as follows:

Interacting with different elements of the display can affect other display components. For more detailed information on interaction with the various componenets of the display, see the ensuing sections of this guide.

 


Basic Use

To load a dataset into Sequence Surveyor, select the "Load Dataset" tab from the Navigation Pane. Click on the "Load" button to open the load file dialog box and navigate to the desired dataset on your computer (NOTE: files must have .db, .sql, or .sqlite extensions). Select the file and click "Open" to load the data into the tool. The phylogenetic tree is automatically loaded if the dendrogram's .tree file is of the same filename and same file location as the dataset (see"Customizing the Display" for more information).

The default display orders genome rows in the Primary Display by their position as specified in the .tree file if available and otherwise by their genome_id field (see "Data Format" for details). Rows can be rearranged by holding down the Alt key and clicking and dragging sequence rows vertically within the display. Crossing branches in the phylogenetic tree are then reduced in opacity. Within each row, neighboring genes are grouped into blocks, collections of genes aggregated spatially according to the Block Width parameter (see "Customizing the Display" for details). By default, each block is displayed as the average color of genes within the region represented by the block; however, alternative visual representations of the blocks are available.

Blocking

Genes within a particular spatial region are grouped into a single glyph, called a block. The width and representation of a block can be controlled by the user.

Mousing over a block in the Primary Display will reveal a tooltip containing the names of the genes within the block, followed by the chromosome name and genome name on a separate line. If there are a large number of genes within a block, the list of names may be truncated, as indicated by an ellipse. Mousing over a block will also highlight any blocks in the Primary Display containing at least one gene orthologous to a gene in the current block, the locations of the component genes' ortholog groups in the Histogram display and the corresponding genes in the Zoom Window, if locked, as well as the active genome branch (green) and immediate sibling branchs of the tree (red). Clicking on a block in the Primary Display will cause lines to be drawn between all blocks containing at least one gene matching a gene in the clicked block. Links are only drawn between blocks sharing a matching gene. Clicking on the block again will remove the lines. Right clicking on any block and selecting "Clear Links" will remove all lines from the Primary Display.

Detailed information about the contents of a block can be accessed through the Primary Display. Right clicking on a block and selecting "Details" will open a pop-up containing a list of genomes in which genes orthologous to the component genes of the clicked block can be found. Selecting a genome name displays information about the orthologous gene in that sequence, including information from the description field of the database. Genes are indicated by '*' and blocks are separated by '***'. The blocks from the genome initially selected are preceded by "Target:" in the genome window.

Genes contained within a block can also be explored in the Zoom Window. Mousing over a block in the Primary Display will cause the components of the block to be displayed in the Zoom Window. Genes are displayed on either side of a horizontal guide line on the basis of their strand field to represent the forward and backward location of the genes on the original strand (see "Data Format" for details). The genes are represented as colored rectangles: their color and positioning correspond to their color and position values as defined by the parameters of the Primary Display. To interact with the components of the locked block, right click on the block in the Primary Display and select "Zoom to Block". The block is then locked into the Zoom Window and outlined in the Primary Display. If the component genes in the Zoom Window are dense, they may be aggregated into blocks, in which case, a zoomed block can be further explored by again locking the block in the Zoom Window.

Once locked, mousing over an gene will highlight blocks in the Primary Display containing at least one orthologous gene to the highlighted gene, the branches of the genomes in the Dendrogram for genomes containing orthologous elements to the highlighted gene up to the most recent common ancestor, and the location of the ortholog group in the Histogram distribution. Clicking on a block will draw a connecting link between blocks containing orthologous genes in the Primary Display. Clicking again will remove the link. Right click on any block in the Primary Display or Zoom Window and select "Unlock Zoomed Block" to release the block from the Zoom Window and return to mouseover zoom in the Primary Display.

Mousing over a bar in the Histogram display creates a tooltip with the list of ortholog_group_id values of the ortholog groups represented by the bars. It highlights branches of genomes containing genes from that ortholog group in the Dendrogram to the most recent common ancestor and the corresponding blocks containing the gene in the Primary Display. Clicking and dragging the mouse in the Histogram display draws a red rectangle in the Histogram. Releasing the mouse filters the data in the Primary Display: blocks containing genes corresponding to the bars within the rectangle remain fully opaque, while all other blocks are made partially transparent. Pressing escape removes this filter and clears the Histogram display.

 


Customizing the Display

The Navigation Pane contains a variety of menus for customizing the display. Selecting a tab in the Navigation Pane will open the corresponding menu set. This section will discuss these menus and how they can be used to support analysis in Sequence Surveyor. See the paper for details and examples of the types of analysis that these customizations can support.

Load Dataset

The Load Database Menu

The "Load Dataset" menu contains options for interacting with and changing the underlying data of the display.

Properties

The Properties menu

The "Properties" menu allows the user to manipulate the display parameters of the Primary Display. The "Update" button adjusts the parameters of the display to match those set in this menu.

Color Scale

The Color Scale menu

The "Color Scale" menu displays the currently active color mappings for the Primary Display.

The Color Scale menu displays the detailed information about the current color mapping properties of the display: the active ramp is displayed from minimum property value to maximum and the minimum and maximum values are displayed below the ramp. For split color encodings, two ramps are used, one for the first portion of the data (that defined by the "Color Scheme" ramp) and the other for the second (defined by the "Secondary Scheme" ramp).

Filter Menu

The Filter Menu

The "Filter Menu" allows the user visually filter the data displayed in the Primary Display.

The Filter Menu allows the user to build conjunctive filters over the dataset. By entering data parameters in the text fields, a filter is constructed. Field data is applied as follows:

Multiple parameter values can be entered into each field by separating values using a comma. Clicking the "Filter" button will reduce the opacity of any blocks not satisfying at least one of the filter criteria in the Primary Display, and any gene not matching the criteria in the Zoom Window.

Clicking on "Remove Filter" will clear the current filter boxes and restore the opactity of all filtered blocks. The "Load Filter" button will open a load file dialog box. Opening a .filter file using this dialog will load the parameters of the filter into the "Filter Menu". The "Save Filter" button will open a save file dialog box to write out the filter parameters to a .filter file. All .filter files are of the format:

gene_names: <comma-separated list of gene names>
gene_orthologs: <comma-separated list of ortholog_group_ids>
gene_frequencies: <comma-separated list of frequencies (ortholog group sizes)>
gene_membership_frequencies: <comma-separated list of membership frequencies>
chromosomes: <comma-separated list of chromsome names>
genomes: <comma-separated list of genome names>

 


 


 

Project Page | Sequence Surveyor | TextDNA

 

Adobe AIR must be installed to run Sequence Surveyor and TextDNA.

Email dalbers@cs.wisc.edu for more information.