NGS Quality Control in RNA Sequencing- Some Free Tools

RNA sequencing (‘RNA-seq’) has become one of the most widely used applications for Next-Generation Sequencing. RNA-seq can provide gene expression data more cheaply than microarray, at greater sensitivity, and without the biases inherent in an assay based on quantifying nucleic acid hybridization. RNA-seq can also provide data about alternative splicing, allele-specific expression, expression of non-annotated exons (or genes), and detection of fusion genes in cancer.

Quality Metrics

The standard quality control metrics for all NGS runs, which can be derived from the runtime software and from FASTQC include: total number of reads, ‘Pass Filter’ reads, percentage of reads with average quality >Q30. There are a number of quality metrics specific for RNA-seq that can only be obtained after alignment to an annotated reference genome.

rRNA

One key measurement is the amount of ribosomal RNA (rRNA). This is important since the removal of rRNA by various RNA sample prep protocols is quite inconsistent. If one sample contains 60% rRNA and another contains 5%, then a direct comparison of gene expression between these two samples will be very inaccurate. rRNA reads are automatically captured and filtered out by the ‘contam’ filter in the Illumina CASAVA software during alignment of RNA-seq reads to a reference genome.

Aligning with a bowtie

The popular (open source) aligners Bowtie and BWA do not include this feature, so it is necessary to create a separate filtering sequence of rRNA for your species (you might also include mitochondrial DNA in this filtering data set), then align all reads to this filter set, and write a new output file of ‘unaligned’ reads (–un command option in Bowtie). The unaligned reads become the filtered data set, which can be used to count gene expression. The important statistic is the percentage of reads in each sample which align to the rRNA filter. The total number of reads in the filtered data set (not the original FASTQ file) should be used for per-sample normalization by RPKM or other methods.

Image Larger Volumes with the UltraMicroscope Choros™

From: Miltenyi Biotech

Trust Your Quantification with the DeNovix DS-8X Rapid Eight Channel, 1µL UV-Vis Spectrophotometer

From: DeNovix

The Picard Toolkit

The Picard toolkit (a Java implementation and extension of SAMtools) includes a measurement of percent rRNA as part of its ‘CollectRnaSeqMetrics’ command. This requires aligned BAM files and a genome annotation file as input and the rRNA genes must be defined as a set of intervals (locations) on the same reference genome that is used for alignment.

Unexpected problems

It is also possible to get a rough idea of rRNA content from FASTQC (without alignment of the FASTQ files) by looking in the ‘Overrepresented Sequences’ section. Duplicated sequences that make up more than 0.1% of your total data set are matched against a database of common contaminants, which includes rDNA sequences. Other unexpected problems may show up here as well, such as contamination of samples with viral RNA.

3’ bias

Some RNA-seq sample prep methods use poly-A capture, or RT-PCR with poly-T primers, creating substantially more reads from the 3’ ends of transcripts. The extent of this 3’ bias is not consistent across samples or across genes. The amount of 3’ bias can be computed as a single number by the Picard ‘RNAseqMetrics’ as Meidan-3Prime-Bias as follows:

MEDIAN_3PRIME_BIAS: The median 3 prime bias of the 1000 most highly expressed transcripts, where 3 prime bias is calculated per transcript as: mean coverage of the 3′ most 100 bases divided by the mean coverage of the whole transcript.

Picard also produces a nice PDF output of coverage vs. position on transcripts (all transcripts are scaled to a length of 100) that is probably more easily understood by wet lab investigators.

NGS Quality Control in RNA Sequencing- Some Free Tools

A new software package called provides many of the same metrics as FASTQC and Picard as well as some additional features such as RPKM coverage of the top 25% most highly expressed genes and annotation status of observed splice junctions. It can also detect insert size for paired-end reads and calculate RPKM counts (normalized expression value) for each exon, intron, and transcript as defined by gene models in the annotation file. RSeQC is a set of Python scripts, which requires aligned data (BAM files) and a genome annotation file (in BED format) as input.

Intron Splicing

Alternative splicing of RNA is a huge topic. From a quality control standpoint, we are primarily interested to detect genomic contamination in RNA-seq samples. This can be detected as relatively high numbers of reads that align to annotated introns. Picard provides measures of %intronic and %exon bases. RSeQC provides a nice table of coverage in coding, UTR, and intron regions.

Group	Total bases	Tag count	Tags/Kb
CDS Exons	33302033	2000227	600.63
5’UTR Exons	21717577	4408991	203.01
3’UTR Exons	15347845	3643326	237.38
Introns	1132597354	6325392	5.58

Tools

BOWTIE

https://bowtie-bio.sourceforge.net/index.shtml

BWA

https://bio-bwa.sourceforge.net

CASAVA

https://support.illumina.com/sequencing/sequencing_software/casava.ilmn

FASTQC

https://www.bioinformatics.babraham.ac.uk/projects/fastqc

Picard

https://picard.sourceforge.net

References

SAMtools

Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25: 2078-9. PubMed PMID: 19505943

RSeQC

Wang L, Wang S, Li W. (2012) RSeQC: quality control of RNA-seq experiments.Bioinformatics, 28: 2184-5. PubMed PMID:22743226

Stuart M. Brown

He received his PhD in Molecular Biology from Cornell University.

About Us

Marketing

NGS Quality Control in RNA Sequencing- Some Free Tools

Quality Metrics

rRNA

Aligning with a bowtie

Image Larger Volumes with the UltraMicroscope Choros™

Trust Your Quantification with the DeNovix DS-8X Rapid Eight Channel, 1µL UV-Vis Spectrophotometer

The Picard Toolkit

Unexpected problems

3’ bias

Intron Splicing

Tools

References

Top Resources for Learning a NGS Programming Language

Codon Optimization 101

What To Do When Whole Genome Sequencing Yields Unexpected Results: Is There A One-Size-Fits-All Answer?

GeneDig: An Easy-to-use Genome Browser for Bioscientists

Introduction to DNA Microarrays

An Introduction To ChIP-seq

Are You Leaning On a Crutch? Spotting Habits and Routines That Hold You Back

10 Things Every Molecular Biologist Should Know

About Us

Marketing

NGS Quality Control in RNA Sequencing- Some Free Tools

Quality Metrics

rRNA

Aligning with a bowtie

The Picard Toolkit

Unexpected problems

3’ bias

Intron Splicing

Tools

References

More 'Genomics and Epigenetics' articles

Are You Leaning On a Crutch? Spotting Habits and Routines That Hold You Back

10 Things Every Molecular Biologist Should Know