Quantification and Differential Expression of RNAseq with salmon

Salmon is one of a breed of new, very fast RNAseq counting packages. Like Kallisto and Sailfish, Salmon counts fragments without doing up-front read mapping. Salmon can be used with edgeR and others to do differential expression analysis.

Salmon preprint: http://biorxiv.org/content/early/2015/06/27/021592

Salmon Web site: https://combine-lab.github.io/salmon/

Intro blog post: http://robpatro.com/blog/?p=248

A (very recent) blog post evaluating and comparing methods: https://cgatoxford.wordpress.com/2016/08/17/why-you-should-stop-using-featurecounts-htseq-or-cufflinks2-and-start-using-kallisto-salmon-or-sailfish/

Installation

Download and unpack salmon, and add it to your path:

cd
curl -L -O https://github.com/COMBINE-lab/salmon/releases/download/v0.7.0/Salmon-0.7.0_linux_x86_64.tar.gz
tar xzf Salmon-0.7.0_linux_x86_64.tar.gz
export PATH=$PATH:$HOME/SalmonBeta-0.7.0_linux_x86_64/bin

Getting the data

Do:

sudo chmod a+rwxt /mnt
mkdir /mnt/data
cd /mnt/data/
curl -O https://s3.amazonaws.com/public.ged.msu.edu/nema-subset.tar.gz
tar xzf nema-subset.tar.gz

(This is data from Tulin et al., 2013 that was processed and assembled with the khmer protocols steps 1-3 – basically, what we did in Short read quality and trimming, but for the entire data set.)

Make a directory to work in:

mkdir /mnt/quant

Copy in the transcriptome from the snapshot:

cd /mnt/quant
cp /mnt/data/nema.fa .

Index it with salmon:

salmon index --index nema_index --transcripts nema.fa --type quasi

Link the reads in that we downloaded:

ln -fs /mnt/data/*.fq .

Now, quantify the reads against the reference using Salmon:

for i in *.1.fq
do
   BASE=$(basename $i .1.fq)
   salmon quant -i nema_index --libType IU \
          -1 $BASE.1.fq -2 $BASE.2.fq -o $BASE.quant;
done

(Note that --libType must come before the read files!)

This will create a bunch of directories named something like 0Hour_ATCACG_L002001.quant, containing a bunch of files. Take a look at what files there are:

find 0Hour_ATCACG_L002001.quant -type f

You should see:

0Hour_ATCACG_L002001.quant/lib_format_counts.json
0Hour_ATCACG_L002001.quant/cmd_info.json
0Hour_ATCACG_L002001.quant/libParams/flenDist.txt
0Hour_ATCACG_L002001.quant/aux_info/observed_bias.gz
0Hour_ATCACG_L002001.quant/aux_info/observed_bias_3p.gz
0Hour_ATCACG_L002001.quant/aux_info/expected_bias.gz
0Hour_ATCACG_L002001.quant/aux_info/fld.gz
0Hour_ATCACG_L002001.quant/aux_info/meta_info.json
0Hour_ATCACG_L002001.quant/logs/salmon_quant.log
0Hour_ATCACG_L002001.quant/quant.sf

The two most interesting files are lib_format_counts.json and quant.sf. The latter contains the counts; the former contains the log information from running things. Take a look at the counts metadata –

less 0Hour_ATCACG_L002001.quant/lib_format_counts.json

and see what you think it means... (Use ‘q’ to quit out of less.)

You might also be interested in the log file –

less 0Hour_ATCACG_L002001.quant/logs/salmon_quant.log

A few notes –

So, what should you pay attention to here? Let’s list them out...

Working with the counts

Now, the quant.sf files actually contain the relevant information about expression – take a look:

head -20 0Hour_ATCACG_L002001.quant/quant.sf

The first column contains the transcript names, and the fifth column is what edgeR etc will want - the “raw counts”. However, they’re not in a convenient location / format for edgeR to use; let’s fix that.

Download the gather-counts.py script:

curl -L -O https://github.com/ngs-docs/2016-aug-nonmodel-rnaseq/raw/master/files/gather-counts.py

and run it:

python ./gather-counts.py

This will give you a bunch of .counts files, processed from the quant.sf files and named for the directory they are in.

Now, run an edgeR script (nema.salmon.R) that loads all this in and calculates a few plots –

curl -O -L https://raw.githubusercontent.com/ngs-docs/2016-aug-nonmodel-rnaseq/master/files/nema.salmon.R
Rscript nema.salmon.R

These will produce two plots, nema-edgeR-MDS.pdf and nema-edgeR-MA-plot.pdf.


You can see the plot outputs for the whole data set (all the reads) here:

A challenge exercise

How would we create an MA plot comparing 6 Hour vs 12 Hour?


2016 / Aug / mRNAseq on non-model organisms


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.