Tutorial: Ultra-rare variant detection using consensus reads and targeted sequencing

In this exercise, you will analyze targeted sequencing of the initial burst of genetic diversity in a short E. coli evolution experiment. This tutorial uses data prepared by a special library preparation technique that adds a “molecular index” to each initial DNA fragment. This enables one to sequence many amplification products from this initial read to achieve lower error rates. In addition pulldowns with biotinylated oligos were used to enrich for only certain genes in the E. coli genome to achieve deeper sequencing of regions that were expected to have beneficial mutations.

Note

This tutorial was created for the EMBO Practical Course Measuring intra-species diversity using high-throughput sequencing held 27–31 July 2015 in Oeras, Portugal.

1. Download data files

First, create a directory called sscs_targeted:

$ mkdir tutorial_sscs_targeted
$ cd tutorial_sscs_targeted

Reference sequence

breseq prefers the reference sequence in Genbank or GFF3 format. In this example, the reference sequence is Escherichia coli B strain REL606. The Genbank (Refseq) accession number is: NC_012967 . You can search for this sequence at http://www.ncbi.nlm.nih.gov/ or follow this direct link.

This reference sequence was created by using gdtools MASK.

Read files

Download from.... FILL IN DETAILS

2. Generate SSCS Reads

First, we need to pre-process the reads to construct single-strand consensus reads.

SSCS_DCS.py -f1 DED110_CATGGC_L006_R1_001.fastq -f2 DED110_CATGGC_L006_R2_001.fastq -p DED110 -s -m 2 --log SSCS_Log

Now we need to trim 16 bases off each read as this represents the 12 bases of the molecular index and a 4 base constant region.

3. Initial breseq runs to identify candidate variants

DETAILS NEEDED.

We still pass the rest of the genome so that these reads will not be mismapped to our target regions. Furthermore, this lets us find new transposon insertions that involve junctions from these genes to transposons present elsewhere in the genome.

ENTER COMMANDS

4. Second breseq runs to systematically tabulate frequencies

Sometimes a variant may not be detected based on one sample, but we could detect some reads supporting it if we knew to look for it. We can tell breseq to tabulate certain junctions / point mutations by passing it a GenomeDiff file of these targets.

First, create a file describing all of the variants that we have identified.

Now re-run breseq

ENTER COMMANDS

5. Examine time-course data

Graph some of these trajectories in R.

Rule out false-positives using autocorrelation filter.

Fit selection coefficients.