De novo transcriptome assembly with Trinity

This tutorial will use mRNAseq reads from a small subset of data from Nematostella vectensis (Tulin et al., 2013).

Original RNAseq workflow protocol here, more updated protocol here.

Installation

On a Jetstream instance, run the following commands to update the base software:

sudo apt-get update && \
sudo apt-get -y install screen git curl gcc make g++ python-dev unzip \
  default-jre pkg-config libncurses5-dev r-base-core r-cran-gplots \
  python-matplotlib python-pip python-virtualenv sysstat fastqc \
  trimmomatic bowtie samtools blast2 wget bowtie2 openjdk-8-jre \
  hmmer ruby

Install Trinity:

cd ${HOME}

wget https://github.com/trinityrnaseq/trinityrnaseq/archive/Trinity-v2.3.2.tar.gz \
    -O trinity.tar.gz
tar xzf trinity.tar.gz
cd trinityrnaseq*/
make |& tee trinity-build.log

Assuming it succeeds, modify the path appropriately:

echo export PATH=$PATH:$(pwd) >> ~/.bashrc
source ~/.bashrc
cd

You will also need to set the default Java version to 1.8

sudo update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

Then, let’s check we still have our reads from the QC lesson

set -u
printf "\nMy trimmed data is in $PROJECT/quality/, and consists of $(ls -1 ${PROJECT}/quality/*.qc.fq.gz | wc -l) files\n\n"
set +u

where set -u should let you know if you have any unset variables, i.e. if the $PROJECT variable is not defined.

If you see -bash: PROJECT: unbound variable, then you need to set the $PROJECT variable.

export PROJECT=/mnt/work

and then re-run the printf code block.

NOTE: if you do not have files, please rerun quality trimming steps here

Running the Assembly!

Let’s make another working directory for the assembly

cd ${PROJECT}
mkdir -p assembly
cd assembly

For paired-end data, Trinity expects two files, ‘left’ and ‘right’:

zcat ${PROJECT}/quality/*R1*.qc.fq.gz > ${PROJECT}/assembly/left.fq
zcat ${PROJECT}/quality/*R2*.qc.fq.gz > ${PROJECT}/assembly/right.fq

Assembling with Trinity

Here is the assembly command!

cd ${PROJECT}/assembly
Trinity --left left.fq \
  --right right.fq --seqType fq --max_memory 14G \
  --CPU 2

Note that these last two parts (--max_memory 14G --CPU 2) configure the maximum amount of memory and CPUs to use. You can increase (or decrease) them based on what machines you are running on.

Once this completes, you’ll have an assembled transcriptome in ${PROJECT}/assembly/trinity_out_dir/Trinity.fasta.