RNA sequencing (‘RNA-seq’) has become one of the most widely used applications for Next-Generation Sequencing. RNA-seq can provide gene expression data more cheaply than microarray, at greater sensitivity, and without the biases inherent in an assay based on quantifying nucleic acid hybridization. RNA-seq can also provide data about alternative splicing, allele-specific expression, expression of non-annotated exons (or genes), and detection of fusion genes in cancer.
The standard quality control metrics for all NGS runs, which can be derived from the runtime software and from FASTQC include: total number of reads, ‘Pass Filter’ reads, percentage of reads with average quality >Q30. There are a number of quality metrics specific for RNA-seq that can only be obtained after alignment to an annotated reference genome.
One key measurement is the amount of ribosomal RNA (rRNA). This is important since the removal of rRNA by various RNA sample prep protocols is quite inconsistent. If one sample contains 60% rRNA and another contains 5%, then a direct comparison of gene expression between these two samples will be very inaccurate. rRNA reads are automatically captured and filtered out by the ‘contam’ filter in the Illumina CASAVA software during alignment of RNA-seq reads to a reference genome.
Aligning with a bowtie
The popular (open source) aligners Bowtie and BWA do not include this feature, so it is necessary to create a separate filtering sequence of rRNA for your species (you might also include mitochondrial DNA in this filtering data set), then align all reads to this filter set, and write a new output file of ‘unaligned’ reads (–un command option in Bowtie). The unaligned reads become the filtered data set, which can be used to count gene expression. The important statistic is the percentage of reads in each sample which align to the rRNA filter. The total number of reads in the filtered data set (not the original FASTQ file) should be used for per-sample normalization by RPKM or other methods.
The Picard Toolkit
The Picard toolkit (a Java implementation and extension of SAMtools) includes a measurement of percent rRNA as part of its ‘CollectRnaSeqMetrics’ command. This requires aligned BAM files and a genome annotation file as input and the rRNA genes must be defined as a set of intervals (locations) on the same reference genome that is used for alignment.
It is also possible to get a rough idea of rRNA content from FASTQC (without alignment of the FASTQ files) by looking in the ‘Overrepresented Sequences’ section. Duplicated sequences that make up more than 0.1% of your total data set are matched against a database of common contaminants, which includes rDNA sequences. Other unexpected problems may show up here as well, such as contamination of samples with viral RNA.
Some RNA-seq sample prep methods use poly-A capture, or RT-PCR with poly-T primers, creating substantially more reads from the 3’ ends of transcripts. The extent of this 3’ bias is not consistent across samples or across genes. The amount of 3’ bias can be computed as a single number by the Picard ‘RNAseqMetrics’ as Meidan-3Prime-Bias as follows:
MEDIAN_3PRIME_BIAS: The median 3 prime bias of the 1000 most highly expressed transcripts, where 3 prime bias is calculated per transcript as: mean coverage of the 3′ most 100 bases divided by the mean coverage of the whole transcript.
Picard also produces a nice PDF output of coverage vs. position on transcripts (all transcripts are scaled to a length of 100) that is probably more easily understood by wet lab investigators.
A new software package called provides many of the same metrics as FASTQC and Picard as well as some additional features such as RPKM coverage of the top 25% most highly expressed genes and annotation status of observed splice junctions. It can also detect insert size for paired-end reads and calculate RPKM counts (normalized expression value) for each exon, intron, and transcript as defined by gene models in the annotation file. RSeQC is a set of Python scripts, which requires aligned data (BAM files) and a genome annotation file (in BED format) as input.
Alternative splicing of RNA is a huge topic. From a quality control standpoint, we are primarily interested to detect genomic contamination in RNA-seq samples. This can be detected as relatively high numbers of reads that align to annotated introns. Picard provides measures of %intronic and %exon bases. RSeQC provides a nice table of coverage in coding, UTR, and intron regions.
|Group||Total bases||Tag count||Tags/Kb|
Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25: 2078-9. PubMed PMID: 19505943