We are now living in the era of the $1000 genome. Unfortunately, most of us are still paying significantly more than this for a genome, or an equivalent amount of data in the form of exomes or RNA-seq reads. There are several reasons for this higher-than-expected price and this post aims to highlight where the costs lie. I’ve based the costs in this post on how we run Illumina NGS in my lab, but added a bit on top to reflect something closer to Full Economic Cost (FEC).
Andrea Sboner in Mark Gerstein’s lab at Yale published a great paper in 2011: The real cost of sequencing: higher than you think!1 In this they highlighted the costs of (i) sample handling, (ii) sequencing (including library preparation), (iii) data processing, and (iv) downstream biological analyses. The $1000 genome available from Illumina on the X Ten platform is inclusive of all these, however this is only available if you are going to sequence tens of thousands of Human genomes at 30x coverage. Most NGS users will be generating a few genomes, tens of exomes and perhaps hundreds of RNA-seq samples; so a $1000 X10 genome is a great headline, but not such not a useful number to quote to the rest of us. To try and estimate the real costs of sequencing in 2014 I’ve used the same four step processes as you’ll find in the Sboner paper; I’ve split their single sequencing step in two so library prep and Illumina sequencing are discussed separately.
All experiments need to be thought about and this cost is over-looked most of the time. Trying to quantify this is very difficult but getting a post-doc, their supervisor, a biostatistician, a bioinformatician and someone from the genomics core lab in a room for 30 minutes probably costs a few $100. Samples need to be collected and if this is a mouse experiment or one with clinical samples the collection and processing can add significantly. Lastly, nucleic acids need to be extracted, quality controlled and quantified before any sequencing can start.
Cost: I’ve left this one blank for you to fill in!
Before you can get your samples onto the sequencer you’ve got to make libraries. Most people will be using kits from companies like Illumina, NEB, Agilent and Rubicon, in fact there are literally dozens of providers out there. A standard Illumina adapter ligation kit costs between $50-100 to buy, and probably a couple of days work by a technician or post-doc in the lab. It’s about the same for RNA-seq of ChIP-seq prep as well.
Cost for WGS library prep: $125 per sample.
Unlike library prep, sequencing is very heavily affected by the experimental requirements. Genomes and exomes are using lots of paired-end 125bp sequencing today, whereas mRNA differential gene expression and ChIP-seq use relatively few single-end 50bp reads. You can estimate how many reads you need, and therefore how much you’ll need to spend, by rearranging the Lander-Waterman equation2. The general equation is:C=LN/G*, which can be rewritten as N=CG/L, adding in a cost per million reads gives us a price for the experiment.
The sequencing cost is very much determined by your application so there are some examples below, these use paired-end reads of 125bp costing $2600 per lane (300M reads) or single-end reads of 50bp costing $1200 per lane.
Human genome 30x coverage = (30fold) × (3×109 bp) / (250bpPE125)and requires 360M reads costing $3120.
Human exome 50x coverage = (50fold) × (1.5×108 bp) / (250bpPE125) and requires 30M reads costing $260.
RNA-seq for splicing analysis 50M SE50bp reads costing $200.
ChIP-seq for transcription factor binding 40M SE50bp reads costing $160.
RNA-seq for differential mRNA expression 20M SE50bp reads costing $80.
Human amplicome (30x 250bp amplicons) 500x coverage = (1000fold) × (7.5×104 bp) / (250bpPE125) and requires 0.3M reads costing $3.
Cost for WGS 30x coverage: $3120 per sample.
Data Processing: Primary Processing of FastQ
Whilst biologists writing grants are often accused of leaving out the computational costs, for the primary processing only these do seem somewhat insignificant compared to the wet-lab side of things. The Sboner paper estimated these costs to be around $160 per sample, I’ve kept this value the same and assumed that whilst sequencing data volumes have gone up, computational costs have dropped.
Cost for processing a 30x WGS: $160 per sample.
Downstream Biological Analyses
This is perhaps the most contentious bit to estimate costs. Some people have talked about the $1000 genome requiring a $100,000 interpretation! I think this is a stretch too far; but the reality is that analysis can be very open-ended and as it still takes a bunch of clever people to consider how best to process the information and visualise the results, this can be expensive in terms of time. Time is money, so don’t leave this out of your next grant proposal, or if you do, then expect a lot of harrumphing from your bioinformatics colleagues when you turn up with a disk or FastQ file!
I’m in the minority that believes much of the computational burden is going to disappear from routine analysis, e.g. A vs B RNA-seq, or a ‘basic’ interpretation of an exome or genome. But if you’re doing anything complex be ready to spend time and money.
Cost for interpreting a 30x WGS: ?
The figures above get me to a 2014 cost of about $3500 for a 30x Human genome, and nearly all of that is still sequencing. However that ignores the actual costs of sample handling and downstream biological analysis. If you want lots of Human genomes then you’re probably looking for an XTen provider with spare capacity. But if you want an RNA-seq project of 24 samples in triplicate your faced with a bill for about $15,000 ((24x3x$125)+(24x3x$80)). So when your core responds to your “how much…” question with a “how long is a piece of string” don’t be too frustrated, sit down and talk it through.
Lastly, although the $1000 genome is not available to most of us we should not lose sight of the fact that in 10 years we’ve come from a $300M genome to one that’s realistically available at around $3000. That’s a 100,000 fold drop!
*C = genome coverage, G is the haploid genome length, L is the sequence read length, and N is the number of sequence reads
- Sboner, A., Mu, X. J., Greenbaum, D., Auerbach, R. K. & Gerstein, M. (2011) The real cost of sequencing?: higher than you think?! Genome Biol. 12:125
- Lander, E. S. & Waterman, S. (1988) Genomic Mapping by Fingerprinting Random Clones?: A Mathematical Analysis. Genomics 239:231–9 .