Design of LayerCake


Previous NGS Vis

Next Generation Sequencing (NGS) relies on aligning many different reads into a single sequence. Variation can then be identified by comparing the reads at a location to a reference (or consensus). Visualizing this procedure is typically done with a “scaffold view” where all the reads are “stacked” on top of each other.

scaffold

A “scaffold view” of NGS data: individual reads are placed on top of each other, and then a consensus sequence (or sequence logos) is displayed as a summary on the bottom.

This technique does not scale, either with the depth of reads at a particular location, or the length of the sequence to be analyzed (for viral data this can be tens of thousands of reads at a particular location, and tens of thousands of base pairs in a genome).  Metadata (total coverage, quality of reads, quality of alignment) is difficult or impossible to see from the scaffold per se, and must be recovered with annotation or explicit interaction. Lastly, the scaffold view does not scale to number of populations to compare: even if there is visual space to display one “scaffold,” displaying a dozen is out of the question for meaningful, holistic analysis.

LayerCake

layer

A layer, corresponding to reads taken from a single population of viruses. Each colored bar represents a region of 30 base pairs. The redder the bin, the more variation occurs in that region.

layerexp

Clicking on a bin to expand its contents using focus+context navigation. The rest of the layer is compressed to make room for a detailed view.  At this level of detail, we can display reference sequence(s).

Our solution to the problem is to aggregate the scaffold into a “layer” – a colored band showing variation as we move through the genome. In order to fit the entire layer into a single screen, the individual base pairs are “binned” into discrete regions of tens of base pairs a piece. Viewers can control the size of these bins, and can click on a bin to view its contents in detail.

This visual metaphor is desirable for holistic analysis of variation. We do not have the screen space or cognitive resources to meaningfully display all data simultaneously. In our display the dataset is displayed at a low level of resolution, but red regions of high importance are also visually salient, indicating regions that warrant investigation in detail.

Confidence Visualization

The color wedge for LayerCake: locations with high variation are redder, locations with low confidence are grayer.

The color wedge for LayerCake: locations with high variation are redder, locations with low confidence are grayer.

Next generation sequencing has many sources of error. Reads can be misaligned, misread, giving the impression that variation exists when it in fact does not. Most sequencing pipelines generate metrics which are correlated with errors of this type, but there are not solid, a priori metrics to determine whether there is certainly significant variation at a particular location. We allow the viewer to interactively set their own heuristics for evaluating confidence in data.

Considering data confidence has the unfortunate side effect of adding an extra dimension to our data: a location has an amount of variation and also a certainty value. Encoding both values with high fidelity using color is difficult, since color maps showing two values at once (bivariate color ramps) make sacrifices between intuitive ordering of color, fidelity of encoding, and ease of discrimination.  Our solution is to assume that viewers do not need to discriminate values when data quality is low: instead of a full two dimensional color ramp (conceptually a square), we have a color ramp where as confidence decreases, the number of discrete, discriminate colors also decreases, down to where below a certain level of certainty all values are the same grey (an arc, or wedge). This has the visual effect of a “confidence fog” – values recede into the background as confidence decreases. We allow the viewer to control the speed at which this fogging out occurs.

Aggregation

Above, a bin appears to have no variation. This is because most locations in the bin have no variant reads. Event striping shows that there are two locations within the bin that do have high variation, they are simply lost when we average.

Above, a bin appears to have no variation. Event striping shows that there are two locations within the bin that do have high variation, they are simply lost when we average.

We consider displaying a view of the entire dataset on a single screen a high priority. Given the scale of the datasets for this problem, this means that some data by necessity must be lost. Even if we had enough pixels to show every single location simultaneously, it is not clear that the viewer would be able to meaningfully analyze such a display. Binning is one solution to this problem (each glyph encodes a discrete region of the genome), but this leads to another choice: how to aggregate information at the level of base pair into a glyph of many base pairs.

Our decision is to average, on the assumption that interesting patterns of variation occur across multiple base pairs. In practice this assumption can be violated: either there are particular locations of great importance, or there are so many base pairs in a bin that the “interesting” variation is drowned out in the averaging by the “uninteresting” other locations. To circumvent this we offer the optional technique of “event striping” – if the viewer cares about locations of particularly high variation then a red stripe is drawn wherever there is a location meeting this threshold of interest. Some of the visual space of a bin is sacrificed to show that there is an interesting event going on that might be lost by averaging.

Leave a Reply