Tutorial: Ultra-rare variant detection using consensus reads and targeted sequencing¶
In this exercise, you will analyze targeted sequencing of the initial burst of genetic diversity in a short E. coli evolution experiment. This tutorial uses data prepared by a special library preparation technique that adds a “molecular index” to each initial DNA fragment. This enables one to sequence many amplification products from this initial read to achieve lower error rates. In addition pulldowns with biotinylated oligos were used to enrich for only certain genes in the E. coli genome to achieve deeper sequencing of regions that were expected to have beneficial mutations.
Note
This tutorial was created for the EMBO Practical Course Measuring intra-species diversity using high-throughput sequencing held 27–31 July 2015 in Oeras, Portugal.
1. Download data files¶
First, create a directory called sscs_targeted:
$ mkdir tutorial_sscs_targeted
$ cd tutorial_sscs_targeted
Reference sequence¶
breseq prefers the reference sequence in Genbank or GFF3 format. In this example, the reference sequence is Escherichia coli B strain REL606. The Genbank (Refseq) accession number is: NC_012967 . You can search for this sequence at http://www.ncbi.nlm.nih.gov/ or follow this direct link.
This reference sequence was created by using gdtools MASK
.
Read files¶
Download from.... FILL IN DETAILS
2. Generate SSCS Reads¶
First, we need to pre-process the reads to construct single-strand consensus reads.
SSCS_DCS.py -f1 DED110_CATGGC_L006_R1_001.fastq -f2 DED110_CATGGC_L006_R2_001.fastq -p DED110 -s -m 2 --log SSCS_Log
Now we need to trim 16 bases off each read as this represents the 12 bases of the molecular index and a 4 base constant region.
3. Initial breseq runs to identify candidate variants¶
DETAILS NEEDED.
We still pass the rest of the genome so that these reads will not be mismapped to our target regions. Furthermore, this lets us find new transposon insertions that involve junctions from these genes to transposons present elsewhere in the genome.
ENTER COMMANDS
4. Second breseq runs to systematically tabulate frequencies¶
Sometimes a variant may not be detected based on one sample, but we could detect some reads supporting it if we knew to look for it. We can tell breseq to tabulate certain junctions / point mutations by passing it a GenomeDiff file of these targets.
First, create a file describing all of the variants that we have identified.
Now re-run breseq
ENTER COMMANDS
5. Examine time-course data¶
Graph some of these trajectories in R.
Rule out false-positives using autocorrelation filter.
Fit selection coefficients.