Whether your experiment relies upon a reference-based genome assembly or mapping reads to a reference genome to identify variants, you need to choose a human reference genome assembly.
But wait! You go to the FTP site of NCBI’s refseq and click on the Homo sapiens folder. There you are presented with two choices. Which one should you choose? GRCh37 or GRCh38? What differs between them? Are they just the same? Does it make a difference which you choose?
To address these questions, let’s talk about what are reference genomes: why do we need them, what are the distinctions among them, and finally how will they affect our results?
What Are Reference Genomes?
Reference Genomes are strings of ATCG nucleotides that represent the complete set of genes from an organism. These genomes are stored in a database.
Maybe not surprisingly, organisms of the same species have some variations at the gene level. A reference genome is representative of the sequence of an organism’s genome. Note, however, that a reference genome is not an ideal genome. Instead, it is more or less sort of a consensus built from assembled genomes by different external donors.
A common source to fetch reference genomes is NCBI’s refseq database. This is a consortium containing reference genomes from 72,965 different organisms (as of the Sept 15, 2017 release) and spans prokaryotes, eukaryotes, and viruses.
Why Do We Need a Reference Genome?
We need a reference genome for sample comparison and to point out the differences and to answer biological questions.
As kids, many of us loved the “find the differences” challenge in our local newspaper, where you spot the differences between two pictures. Generally, you use one picture as a reference and find the missing or changed objects in the second picture.
You can apply that same technique here, because it is is exactly what you do when you compare genomes. You find out the misalignments or mutations in your sample’s genome by comparing it to the reference genome.
Confusion with Terminologies
Some reference genome assemblies are available from a number of sources and are named differently. In our example, GRC37 and hg19 are the same but named differently based on the institution, Genome Reference Consortium (GRC) and the University of California at Santa Cruz (UCSC), respectively. The same is true of GRC38 and hg38.
Earlier human reference genome versions include:
- NCBI36 or hg18 (2006)
- NCBI35 or hg17 (2004)
- NCBI34 or hg16 (2003)
GRCh37 vs. GRCh38: What’s the Difference?
Both, GRCh37 and GRCh38 are human genome assemblies by the Genome Reference Consortium (GRC). GRCh38 (also called “build 38”) was released four years after the GRCh37 release in 2009, so it can be viewed as a version with updated annotations to the earlier assembly.
Primarily, there are three updates in the GRCh38 version:
- Repair of incorrect reads
- Inclusion of model centromere sequences
- Addition of alternate loci
Apart from these, some misassembled areas in GRCh37 have been retiled in GRCh38. This is the first human reference genome to have centromere sequences, replacing 3 million gaps in the earlier build (i.e., GRCh37). The inclusion of centromere sequences will open up new arenas for study that have never been accessible before.
GRCh38 also includes sequences of the genome which have been partially captured in earlier versions. However, there are gaps still present in the genome and new technologies and methods are all contributing to close the gaps, aiming for maximum coverage of the human genome.
Do I Need to Re-Analyze My Data Using GRCh38 Now?
If you have been using GRCh37 you do not need to go back and re-analyze your data. Thankfully NCBI has taken care of this.
NCBI’s Genome Remapping Service converts annotation data from GRCh37 to GRCh38. For more details about this tool, click here.
Final Thoughts on Reference Genomes
GRCh38 is an improvement over GRCh37 in regards to genome assembly aspects. This build yields more reliable genomic analysis results. The annotations for these latest assemblies are available on the major browsers (NCBI, UCSC and Ensembl). Undoubtedly, there will be more updates in the future. So stay abreast of datasets from new sequencing technologies and genome analysis pipelines, along with their many more annotation tracks.
- NCBI. (2014) Genome Remapping Service assists in the transition to the new human genome reference assembly (GRCh38). [online] NCBI Insights.
- Pruitt KD, Tatusova T, Maglott DR. (2007). NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research. 35(Database issue): D61–5.
- Bio-IT World.