Analyzing RNA-Seq Data

Bitesize Bio Search

Search below to delve into the Bitesize Bio archive. Here, you’ll find over two decades of the best articles, live events, podcasts, and resources, created by real experts and passionate mentors, to help you improve as a bioscientist. Whether you’re looking to learn something new or dig deep into a topic, you’ll find trustworthy, human-crafted content that’s ready to inspire and guide you.

RNA-seq is based on next-generation sequencing (NGS) and allows for discovery, quantitation and profiling of RNA. The technique is quickly taking over a slightly older method of RNA microarrays to get a more complete picture of gene expression in a cell.

Data generated by RNA-seq can illustrate variations in gene expression, identify single nucleotide polymorphisms (SNPs), profile transcription and identify new genes. RNA-seq is better suited for following rapid changes in cellular transcriptomes, finding post-transcriptional modifications, gene fusion, and other changes in transcripts. Modern NGS methods have made these discoveries faster to come across.

Key Metrics in RNA-Seq

A number of key data points have been found to be valuable for interpreting RNA-seq results. These include:

Total, mapped and transcript-associated reads: Reads (cDNA fragments, often produced in tens of millions) are mapped to the genome or transcriptome. More reads will indicate a deeper analysis and discovery of lower-expression genes. Percentage of mapped reads will indicate the accuracy of sequencing and rule out contaminating DNA. And transcript-associated reads will reveal the existence of regulatory and expression regions.
Aligned reads: Matching the reads to a reference sequence, or known genome, will show similarities and differences.
Strand specificity: Some library preparation approaches allow for the retention of strand-specific information so that aligned cDNA-derived reads correspond to the original mRNA.
Normalization: Methods used to remove technical biases from sequencing and improve comparability of test sequences to references. These include spike-in controls such as the Invitrogen ERCC controls, and a number of mathematical adjustments described below.

Tools for RNA-Seq Data Analysis

Methods for evaluating how RNA-based mechanisms impact gene regulation and disease and phenotypic variation include comparisons to sequences collected by the ENCODE Consortium, an international collaboration of genetic scientists funded by the US Human Genome Research Institute, and/or comparisons to reference transcriptomes, the number and variety of which are growing rapidly and largely available online. Other analysis software, such as the Partek Genomics Suite, analyzes microarray, qPCR, and pre-processed NGS data from a desktop computer. The Galaxy Project community hub posts a course adapted from Weill Cornell Medical Center on how to use these analytical tools.

Spike-In Controls

A mistaken assumption in sequencing is that all RNA yields are equal. Cells from different experimental conditions, however, do not yield identical amounts of DNA and RNA, reducing comparability of sequences.

Spike-in controls must be added proportional to the number of cells for data normalization, allowing accurate interpretations of true increases (or decreases) in signals. The Invitrogen external RNA control consortium (ERCC) spike-in control mix provides a blend of synthetic transcripts that mimic the lengths of natural eukaryotic mRNAs.

The more abundant a unique read is, the more likely fragments from it are going to be sequenced. But counts need to be normalized, so they can compare with other reads, samples and experiments. A number of mathematical adjustments make this possible:

RPKM: Reads Per Kilobase Million, this adjusts comparisons of shorter and longer isoforms (since longer isoforms will have more reads). In this case, this is done by dividing the number of reads by the kilobase number, and then compared to the total number of fragments (usually in the millions).
FPKM: Fragments Per Kilobase Million, this is similar to RPKM, but accounts for the fact that two reads can map to one fragment and avoids counting that fragment twice.
TPM: Transcripts Per Million, this helps analyze RNA-seq data from two different tissues. RPKs will be the identical in each sample for the same isoform, TPM will compare to total number of transcripts to identify differences between tissues.

Analyzing Stop Sites

Identifying transcription stop sites and polyadenylation poly(A) can often require a special type of sequencing. PolyA tails are important because they are part of the process leading to transcription stops and the creation of mature mRNA. They are often added to the 3’ terminal of RNA to stabilize the RNA in eukaryotic cells, making translation more efficient. The Invitrogen^TM Collibri^TM Stranded RNA Library Preparation kit for Illumina^TM systems can help to sequence the poly(A) tail and identify these sites and alternative adenylation more easily without additional sequencing steps.

RNA-Seq Provides New Avenues for Research

RNA-seq is quickly helping gain understanding of the complexities of gene expression —complexities that may help develop new ways to diagnose and treat cancer and a host of other diseases and determine genetic solutions in applications ranging from agriculture to health to industry. But much of this complexity invites the risk of observational bias, including assumptions of rates of RNA expression yields, and comparisons of reads. Fortunately, many tools are available that can help normalize RNA-seq data and help make meaningful conclusions from different experimental conditions.

You find out more by downloading our free infographic.