Assembly statistics and evaluation¶

Note

If you are starting at this point, you’ll need a copy of the assembly we just performed (Running the actual assembly). You can set that up by doing:

sudo chmod a+rwxt /mnt
cd /mnt
mkdir work
cd work
curl -L -O https://github.com/ngs-docs/2016-aug-nonmodel-rnaseq/raw/master/files/assembly-and-reads.tar.gz
tar xzf assembly-and-reads.tar.gz

So, we now have an assembly of our reads in rna-assembly.fa. Let’s take a look at this file –

head rna-assembly.fa

This is a FASTA file with complex (and not, on the face of it, very informative!) sequence headers, and a bunch of sequences. There are three things you might want to do with this assembly - check its quality, annotate it, and search it. Below we’re going to check its quality; other workshops do (will) cover annotation and search.

Applying transrate¶

transrate is a program for assessing RNAseq assemblies that will give you a bunch of assembly statistics, along with a few other outputs.

First, let’s download and install it:

cd
curl -O -L https://bintray.com/artifact/download/blahah/generic/transrate-1.0.3-linux-x86_64.tar.gz
tar xzf transrate-1.0.3-linux-x86_64.tar.gz
export PATH=~/transrate-1.0.3-linux-x86_64:$PATH

Now run transrate on the assembly to get some preliminary stats:

cd /mnt/work
transrate --assembly=rna-assembly.fa --output=stats

This should give you output like this:

[ INFO] 2016-08-20 10:25:51 : n seqs                           60
[ INFO] 2016-08-20 10:25:51 : smallest                        206
[ INFO] 2016-08-20 10:25:51 : largest                        4441
[ INFO] 2016-08-20 10:25:51 : n bases                       73627
[ INFO] 2016-08-20 10:25:51 : mean len                    1227.12
[ INFO] 2016-08-20 10:25:51 : n under 200                       0
[ INFO] 2016-08-20 10:25:51 : n over 1k                        34
[ INFO] 2016-08-20 10:25:51 : n over 10k                        0
[ INFO] 2016-08-20 10:25:51 : n with orf                       38
[ INFO] 2016-08-20 10:25:51 : mean orf percent              67.01
[ INFO] 2016-08-20 10:25:51 : n90                             816
[ INFO] 2016-08-20 10:25:51 : n70                            1562
[ INFO] 2016-08-20 10:25:51 : n50                            1573
[ INFO] 2016-08-20 10:25:51 : n30                            2017
[ INFO] 2016-08-20 10:25:51 : n10                            3417
[ INFO] 2016-08-20 10:25:51 : gc                             0.47
[ INFO] 2016-08-20 10:25:51 : bases n                           0
[ INFO] 2016-08-20 10:25:51 : proportion n                    0.0

...which is pretty useful basic stats.

I’d suggest paying attention to:

n seqs
largest
mean orf percent

and more or less ignoring the rest on a first pass.

Note: don’t use n50 to characterize your transcriptome, as with transcripts you are not necessarily aiming for the longest contigs, and isoforms will mess up your statistics in any case.

Evaluating read mapping¶

You can also use transrate to assess read mapping; this will use read evidence to detect many kinds of errors.

To do this, you have to supply transrate with the reads:

transrate --assembly=rna-assembly.fa --left=left.fq --right=right.fq --output=transrate_reads

The relevant output is here:

[ INFO] 2016-08-20 10:27:47 : Read mapping metrics:
[ INFO] 2016-08-20 10:27:47 : -----------------------------------
[ INFO] 2016-08-20 10:27:47 : fragments                     57158
[ INFO] 2016-08-20 10:27:47 : fragments mapped              50840
[ INFO] 2016-08-20 10:27:47 : p fragments mapped             0.89
[ INFO] 2016-08-20 10:27:47 : good mappings                  8276
[ INFO] 2016-08-20 10:27:47 : p good mapping                 0.14
[ INFO] 2016-08-20 10:27:47 : bad mappings                  42564
[ INFO] 2016-08-20 10:27:47 : potential bridges                18
[ INFO] 2016-08-20 10:27:47 : bases uncovered               12751
[ INFO] 2016-08-20 10:27:47 : p bases uncovered              0.17
[ INFO] 2016-08-20 10:27:47 : contigs uncovbase                47
[ INFO] 2016-08-20 10:27:47 : p contigs uncovbase            0.78
[ INFO] 2016-08-20 10:27:47 : contigs uncovered                14
[ INFO] 2016-08-20 10:27:47 : p contigs uncovered            0.23
[ INFO] 2016-08-20 10:27:47 : contigs lowcovered               25
[ INFO] 2016-08-20 10:27:47 : p contigs lowcovered           0.42
[ INFO] 2016-08-20 10:27:47 : contigs segmented                14
[ INFO] 2016-08-20 10:27:47 : p contigs segmented            0.23

Here, the percent of good mappings is probably the first number to look at - this is mappings where both members of the pair are aligned in the correct orientation on the same contig, without overlapping either end. (See transrate metrics for more documentation.)

Using transrate to compare two transcriptomes¶

transrate can also compare an assembly to a “reference”. One nice thing about this is that you can compare two assemblies...

First, install the necessary software:

transrate --install-deps ref

Second, download a different assembly – this is one we did with the same starting reads, but without using digital normalization:

curl -O -L https://github.com/ngs-docs/2016-aug-nonmodel-rnaseq/raw/master/files/rna-assembly-nodn.fa.gz
gunzip rna-assembly-nodn.fa.gz

Compare in both directions:

transrate --assembly=rna-assembly.fa --reference=rna-assembly-nodn.fa --output=assembly-compare1

and

transrate --reference=rna-assembly.fa --assembly=rna-assembly-nodn.fa --output=assembly-compare2

First results:

[ INFO] 2016-08-20 10:35:35 : Comparative metrics:
[ INFO] 2016-08-20 10:35:35 : -----------------------------------
[ INFO] 2016-08-20 10:35:35 : CRBB hits                        54
[ INFO] 2016-08-20 10:35:35 : n contigs with CRBB              54
[ INFO] 2016-08-20 10:35:35 : p contigs with CRBB             0.9
[ INFO] 2016-08-20 10:35:35 : rbh per reference               0.9
[ INFO] 2016-08-20 10:35:35 : n refs with CRBB                 32
[ INFO] 2016-08-20 10:35:35 : p refs with CRBB               0.53
[ INFO] 2016-08-20 10:35:35 : cov25                            18
[ INFO] 2016-08-20 10:35:35 : p cov25                         0.3
[ INFO] 2016-08-20 10:35:35 : cov50                            18
[ INFO] 2016-08-20 10:35:35 : p cov50                         0.3
[ INFO] 2016-08-20 10:35:35 : cov75                            18
[ INFO] 2016-08-20 10:35:35 : p cov75                         0.3
[ INFO] 2016-08-20 10:35:35 : cov85                            16
[ INFO] 2016-08-20 10:35:35 : p cov85                        0.27
[ INFO] 2016-08-20 10:35:35 : cov95                            14
[ INFO] 2016-08-20 10:35:35 : p cov95                        0.23
[ INFO] 2016-08-20 10:35:35 : reference coverage             0.24

Second results:

[ INFO] 2016-08-20 10:36:45 : Comparative metrics:
[ INFO] 2016-08-20 10:36:45 : -----------------------------------
[ INFO] 2016-08-20 10:36:45 : CRBB hits                        45
[ INFO] 2016-08-20 10:36:45 : n contigs with CRBB              45
[ INFO] 2016-08-20 10:36:45 : p contigs with CRBB            0.75
[ INFO] 2016-08-20 10:36:45 : rbh per reference              0.75
[ INFO] 2016-08-20 10:36:45 : n refs with CRBB                 31
[ INFO] 2016-08-20 10:36:45 : p refs with CRBB               0.52
[ INFO] 2016-08-20 10:36:45 : cov25                            17
[ INFO] 2016-08-20 10:36:45 : p cov25                        0.28
[ INFO] 2016-08-20 10:36:45 : cov50                            17
[ INFO] 2016-08-20 10:36:45 : p cov50                        0.28
[ INFO] 2016-08-20 10:36:45 : cov75                            16
[ INFO] 2016-08-20 10:36:45 : p cov75                        0.27
[ INFO] 2016-08-20 10:36:45 : cov85                            15
[ INFO] 2016-08-20 10:36:45 : p cov85                        0.25
[ INFO] 2016-08-20 10:36:45 : cov95                            15
[ INFO] 2016-08-20 10:36:45 : p cov95                        0.25
[ INFO] 2016-08-20 10:36:45 : reference coverage             0.14

In this case you can see that our first assembly “covers” more of the other assembly than the other assembly does ours (rbh per reference, and reference coverage). However, you can also see that the assemblies differ quite a bit (for reasons that I haven’t tracked down).

Merging two (or more) assemblies¶

Finally, you can also use transrate to merge contigs from multiple assemblies, if you’ve used read mapping –

transrate --assembly=rna-assembly.fa \
     --merge-assemblies=rna-assembly-nodn.fa \
     --left=left.fq --right=right.fq \
     --output=transrate-merge

...although for our assemblies here it doesn’t really improve them.

Back to index: 2016 / Aug / mRNAseq on non-model organisms

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.