RNA-seq is technique used to detect Deferentially Expressed Genes(DEG) It uses Next-Generation Sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given cellular transcriptome.
A highly sensitive and accurate tool for measuring expression across the transcriptome.
Used in detecting previously undetected changes occurring in disease states, in response to therapeutics, under different environmental conditions, and across a broad range of other study designs.
Used to detect both known and novel features in a single assay, enabling the detection of transcript isoforms, gene fusions, single nucleotide variants(snv), and other features without the limitation of prior knowledge.
This report outlines the essential steps in the process of analyzing gene expression data using RNA sequencing (mRNA), and recommends commonly used tools and techniques for this purpose. Differential expression is being assessed between 2 experimental conditions of 12 human samples ; diseased and normal. The focus is mainly on paired-end reads. Here, we document RNA-seq data analysis following guidelines developed by H3ABioNet.
The flow chart below outlines the steps we followed in our analysis.
The reads were available from the H3ABioNet guidelines. We downloaded our reads and metadata from here.
This step is important in obtaining good quality reads for downstream processes. Here we used fastqc tool. FastQC provides a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing techniques.
FastQC revealed that the raw reads were of low quality and needed to be preprocessed.
This cleans up the data and reduces noise in the overall analysis. It involves:
Removal of Adaptors - Adaptors (glossary term) are artificial pieces of DNA introduced prior to sequencing to ensure that the DNA fragment being sequenced attaches to the sequencing flow cell. Usually these adaptors get sequenced, and have already been removed from the reads. But sometimes bits of adaptors are left behind, anywhere from 90% to 20% of the adaptor length. These need to be removed from the reads. The adaptor sequence for this step will have to be obtained from the same source as the sequence data.
Removal of low quality reads - The quality of bases sequenced tends to drop off toward one end of the read. A low quality base call means that the nucleotide assigned has a higher probability of being incorrect.We used a quality of 25.(any base with a quality score below 25 was dropped).
Removal of very short reads - Once the adaptor remnants and low quality ends have been trimmed, some reads may end up being very short. These short reads are likely to align to multiple (wrong) locations on the reference, introducing noise. Hence any reads that are shorter than a predetermined cutoff (we used 20) needs to be removed.
Alternative Tools: Trim_galore, Cut-adapt
All the three trimming tools we tested worked well. Each tools has its own advantages. We found trimmomatic the best for our case, this tool has many options that gives one control of sequence trimming, its possible to trim the sequences at both ends, which may be a challenge in trim-galore. Both trimmomatic and trim-galore can detect the presence of adapter sequence. Cutadapt can be challenging incase you do not know the adapter sequence.
Here we compare the time taken to run each tool.
After trimming, it is good to make sure that your dataset looks better by rerunning Fastqc on the trimmed data. You need to compare between trimmed and raw fastq data.
The number of reads was varrying across samples after trimming.
Below we compare the quality of reads before and after trimming.
(a). Quality of sample37 before trimming
(b) Quality of sample37 after trimming using trimmomatic
(c) Quality of sample37 trimming using trim-galore
(d) Quality of sample37 trimming using cutadapt
Trimmomatic was the best in our case.It removes remnants of adapter sequence left after sequencing, drops sequences of specified minimum length and allows for a sequence cropping option.
The figure below shows the number of reads before trimming and their percentage after trimming.
The phase entails:
Alignment of reads.
Obtaining Trascripts/gene counts.
Collecting and tabulating statistics.
Tools used include:
Kallisto.
Salmon.
Hisat2.
There are two approaches used in this step:
Alignment based approach.
Pseudo-alignment based approach.
kallisto and Salmon are based on the novel idea of pseudo-alignment for rapidly determining the compatibility of reads with targets, without the need for alignment.Pseudo-alignment of reads preserves the key information needed for quantification. These methods are fast compared to alignment based methods.
Hisat2 uses alignment based approach which involves alignment of the reads to the reference genome first and then generating the read counts.
The figure below shows the percentage of reads that were successfully aligned
FeatureCount from subread was used to obtain the counts from hisat alignment. The figure below shows the summary of read counts from hisat.
Disk usage for each tool.
The run time for each tool was observed as below.
We used DESeq2 for normalization, statistical analysis and visualization. DESeq2 estimates variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution.
Significant expression counts were as shown below.
We used two methods to normalize the counts: - Variance Stabilizing Transformation.
Below we compare the effect of each normalization method for each tool.
After VST normalization.
After RLD normalization.
We preferred the following tools to develop our pipeline.