You have spent days, if not weeks, at the bench setting up the treatment and control samples for that crucial experiment. You submitted your cDNA library for sequencing and after a few weeks of waiting anxiously you get back a list of differentially expressed genes. Hooray?! Hold on- not quite yet! There is something you need to know.
It is common practice in the analysis of RNA-seq data, to assume that only a small subset of the transcripts is different between samples. This is essential for the elimination of technical artifacts, such as pipetting errors. It also provides us with a baseline (the ‘point zero’) against which to compare and look for differences between samples.
There are cases however, in which the samples have completely remodeled transcriptomes. This is true for example in cells over expressing c-myc, as researchers at the Whitehead Institute have discovered (Loven et al.). These cells are producing two or three times more RNA than their low c-myc counterparts. This important difference would be masked during analysis, if the RNA-seq data from these two samples where normalized the standard way, by equating the average amplitude of the two transcriptomes. What is worse, as the researchers show, normalizing these samples in the conventional way would lead to completely wrong estimations of what is being up- or down-regulated in the experiment!
Ignore at your peril!
The above example is an extreme case of massive transcriptome remodeling that could hardly go unnoticed as it also affects the morphology of the cells. But who can guarantee that differences of a smaller scale but on a wide range of transcripts are not more common than thought? Normalizing data from such samples in the standard way could turn perfectly good read counts to junk. You might argue that this is an improbable scenario. You might be right but one thing is certain, with more and more transcriptomes being analyzed every day, continuing to ignore this possibility is a ticking bomb.
…and the solution
Wait, there is hope. When there is not a sufficient set of transcripts with stable expression levels, external standards (known as ‘RNA spike-ins’) can provide the baseline for comparison. They are solutions of different RNAs, each with a unique, known sequence, premixed in defined concentrations. You add them to the RNA samples right after extraction (the same amount per million of cells or for every microgram of RNA) and continue with preparing the library according to your protocol. At the analysis stage, they will provide the common reference against which to normalize the reads from your samples. Care only needs to be taken in the relative amount of spike-ins you add to your samples. If added in excess, the signal from the spike-ins will outshine any useful information from the sample. Easy to avoid, as long as you are aware of it.
A (not so) new method
The use of RNA spike-ins is not a new idea. Despite their use for some time in microarray experiments, only a handful of papers have been published with them in the analysis of NGS data. For sure, RNA-seq is a relatively new technology, and it takes some time to adapt pre-existing methods to it. A big breakthrough towards this direction came with the standardization of a set of RNA spike-ins specifically designed for RNA-seq experiments by the External RNA Control Consortium (ERCC; Jiang et al.). This enabled the researchers at Whitehead to overcome their normalization problem for the c-myc high-expressing cells.
It is only a matter of time before the benefits of using external RNA spike-in controls make them a common fixture in RNA-seq experiments. I already talked about their function as an external calibrator in systems where global transcriptional responses are observed. Moreover, as they comprise mixes of different RNA species in carefully calibrated concentrations, they enable the absolute quantification of the number of transcripts per cell. They also provide a reference for assessing the quality of the run as well as for the detection of biases in the cDNA library. To be honest, I don’t understand why someone would not want to use RNA spike-in controls in his or her RNA-seq experiment!
You now have the power
So, there you are- you now know. There is such a thing as too many transcriptional differences between your samples. But before you start double guessing the validity of the list in your hands, look for other signs that your data might require a different normalization approach. Large transcriptional amplifications can be observed as reproducible differences in the amount of extracted RNA per each unit of sample (e.g. the total RNA per millions of cells or milligram of tissue used). Try to remember any noticeable changes in the size or morphology of the cells. And of course, validating some of the listed differences is never a bad idea. These controls will help put you at ease while following up on the already obtained data. My advice however is, next time you plan an experiment, avoid this headache altogether and use external RNA spike-in standards for the normalization of your RNA-seq data!