Analyzing RNA-Seq data for differential gene expression

Materials and Software

Data files
- RNA-Seq data in FASTQ format (Ex: dataset1.fastq, dataset2.fastq)
Genome sequence files
- Genome sequence in FASTA format (Ex: REL606.fna)
- Genome sequence gene annotations in GFF3 format (Ex: REL606.gff3)
Adaptor filtering software *FAR Manual | Download
Read mapper software
- Bowtie2 (first choice)
  - Download Bowtie
  - To build, just type "make" in the source code directory.
  - Add this directory to your $PATH or move bowtie, and bowtie-* star executable to your $PATH
- BWA (alternate choice)
  - Download BWA
  - To build, just type "make" in the source code directory.
  - To install, move the executable "bwa" to somewhere in your $PATH, like $HOME/local/bin.
  - For usage see the BWA manual.
breseq
R statistics package
- Download and install R
Bioconductor R modules
- library(edgeR)
- library(DESEQ)

Commands

Remove adaptor sequences from reads

You will need a FASTA file of adaptor sequences.

Illumina Adapters Sequences (GSAF) | Download FASTA

For each input file you will need to run this command (single-end data):
$far --source datasetX.fastq --target datasetX.noadaptor --adaptive-overlap --trim-end any --adapters gsaf_illumina_adapters.fasta --format fastq-sanger

There is an option to process paired-end data like this:
$far --source datasetX_R1.fastq --source2 datasetX_R2.fastq --target datasetX.noadaptor --adaptive-overlap --trim-end any --adapters gsaf_illumina_adapters.fasta --format fastq-sanger

Compile and install FAR on MacOSX

Unfortunately, FAR comes only with Windows and Linux binaries. To build FAR(2.0) for MacOSX:

Install Apple Developer Tools
Install cmake: $sudo port install cmake
Check out code:
$ svn co https://theflexibleadap.svn.sourceforge.net/svnroot/theflexibleadap theflexibleada
Compile code:
$ cd theflexibleada
$ cmake CMakeLists.txt
Copy executable and library:
$ cp lib/libtbb.dylib ~/local/lib
$ cp build/* ~/local/bin
Add these locations to your path with lines in ~/.profile:
export DYLD_LIBRARY_PATH="$DYLD_LIBRARY_PATH:$HOME/local/lib" export PATH="$PATH:$HOME/local/bin"

Align reads to reference genome

Using bowtie2

First, index your genome so bowtie2 can map read to it:
$bowtie2-build REL606.fna REL606

Then, align each data set:
$bowtie2 -x REL606 -U datasetX.fastq --phred33 -S REL606.sam

Optionally, add the --local flag if your reads do not map end-to-end.

Using BWA

First, index your genome so BWA can map read to it:
$bwa index REL606.fna

Then, align each data set:
$bwa aln REL606.fna dataset1.fastq > datasetX.sai

And convert to SAM format (assumes single-end data): $bwa samse REL606.fna datasetX.sai datasetX.fastq > datasetX.sam

Count reads mapping to genes

breseq RNASEQ -f REL606.fna -r REL606.gbk -o datasetX.count.tab datasetX.sam

Convert alignments to BAM

And convert to BAM format (assumes single-end data):
$samtools faidx REL606.fna
$samtools import REL606.fna datasetX.sam datasetX.unsorted.bam
$samtools sort datasetX.unsorted.bam datasetX
$samtools index datasetX.bam

Now you can use IGV to view them.

Analyze differential gene expression

library(DESEQ)

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
fasta	gsaf_illumina_adapters.fasta	r1	manage	0.2 K	2012-01-30 - 16:31	JeffreyBarrick

Barrick Lab > ProtocolList > ProtocolsRNASeqDifferentialExpression

Contributors to this topic

JeffreyBarrick

Topic revision: r5 - 2012-01-30 - 16:31:50 - Main.JeffreyBarrick