To add a new dataset drag the correct files to the “Virus Data” folder. To be a valid dataset for LayerCake, a folder needs:
- At least one csv (comma separated file) of SNP reports meeting the format standards described below. This csv may contain multiple populations. If multiple csvs are in the folder, then LayerCake will parse every population in every valid csv and display them all in the same window. If there is only one population in a csv and no name column, then the filename will be used as the population name in LayerCake.
The remaining files considered by LayerCake are optional, but recommended:
- Information about the reference sequences in fasta format. If no reference information is provided, LayerCake will produce a reference based on the per population consensus. However, since many programs for generating SNPs do not report on regions with no variation, this consesnsus sequence will likely be missing data at many locations, which can make inter-population comparison difficult. Multiple sequences can be in a single fasta file, or multiple fasta files can be used. The names of the sequences need to match the population names used
- Annotation information in gff, gtf, or gbk format containing information about open reading frames. This is used to calculate synonymy. Some gbk files contain the origin sequence, which will be used as the default reference sequence when provided. If no such sequence is provided, then LayerCake will calculate the “population consensus sequence” by having each population vote on the reference base pair for each location.
LayerCake Data Format
To be a valid CSV for LayerCake standards the data must have a minimum set of columns. Many programs or analysis tools generate different column sets or use different names for similar columns, so LayerCake has some flexibility in column naming. Additional columns will not affect the ability of LayerCake to load data.
|“Nucleotide” or “Minimum” or “Min (with gaps)”||The location of a particular variant with respect to the reference sequence.|
|“Change” or “Variant Nucleotide(s)”||The change from reference to variant (e.g. “A -> G”) or the variant nucleotide, respectively|
|“Variant P-Value” or “Variant P-Value (approximate)” or “Strand-Bias >50% P-value”||A p-value associated with uncertainty of variation at this location.|
|“Coverage”||The number of reads at this location|
An optional column is “Sequence Name”, which is used if there are multiple populations in a csv. Without this column, LayerCake will assume that all variants in the file are from a single population, and name this population after the file name. The sequence name must match the reference name in the fasta file for the reference information to be properly linked.
Converting SAM Files
Included with LayerCake is an application called SAMParser. This program generates a LayerCake compatible folder from an initial folder of one or more SAM files (BAM files must be converted first). To use SAMParser select a folder containing one or more SAM files. Parsing the files make take some time (on the order of minutes per file). If successful, the resulting folder will contain a single CSV file of all variant information for each population provided by the SAM files (a SAM file may contain multiple sequences, so you might see more rows in LayerCake than there were initial files). Simply drag this new CSV into the LayerCake Virus Data folder to be able to load this data into LayerCake on startup.