Breseq Results

Reading the results files

Look at the breseq manual for information on output formats. Click around until you're familiar with what everything means.

First, neaten up any problem mutations in the genome where breseq may not be able to figure out exactly what is happening.

Currently, these are usually:

  1. Gene conversions between copies of near-repeat sequences such as rhs elements and ribosomal RNAs.
    These will look like small deletions in the middle of repeat elements. They may be difficult to reliably recover, so it is best to check for them by making coverage graphs of all genomes and manually examining them.
  2. Short duplications (<20 nt)
  3. Deletions that result from recombination between two new IS element insertions (neither of which existed in the reference genome).
  4. Deletions or duplications at the ends of IS elements

We want to store all of our manual fixes for these mutations in the "GenomeDiff" formatted text file. This file can be used by analysis pipelines to make comparison charts, count up different kinds of mutations, and construct phylogenetic trees. You will need to learn how to encode new lines in this format from the breseq manual.

We also need to split mutations that have been over-written by later mutations. This most commonly happens when there is a new IS element and then small insertions or deletions happen on its boundary later in evolution. You should change what may be one line in the genome diff to two separate entries when there is evidence that they were two separate mutational events.

Check for duplicated regions by manually examining tiling coverage charts. Command to generate files:
batch_run.pl "mkdir output/tiled_coverage; bam2cov --bam=data/reference.bam --fasta=data/reference.fasta --tile=40000 --tile-overlap=5000 --resolution=1000 --output=output/tiled_coverage"

Validating Mutations

Add notes on primer design for various kinds of mutations here... Reference and improve the other protocol sections.

Clone Time Series Data

Next, we want to order the mutations in clone samples over time to figure out which happen early and which happen late. Also, mutations in any one clone genome can be "on" or "off" the main line of descent. You can recognize ones that are on/off the line of descent by checking whether the same mutations are also present in later genomes. Ideally, we can represent this are a tree where we label where each mutation was.

Mixed Population Time Series Data

Mixed population data lets us say something about the frequency of different sub-populations with different mutations in an evolving population. The mixed population data is another opportunity to gain information about which mutations on the main line of descent happened first. If we catch some mutations at intermediate frequencies when others are at 100%, then we know which of those happened first. We also want to reconstruct what mutations were likely to be in the same individuals in the population when it has diverged into multiple types.

 Barrick Lab  >  ProtocolList  >  ProtocolsBreseqResults

Topic revision: r2 - 26 Apr 2011 - 17:32:07 - Main.JeffreyBarrick
 
This site is powered by the TWiki collaboration platformCopyright ©2017 Barrick Lab contributing authors. Ideas, requests, problems? Send feedback