Perhaps one of the most significant discoveries in modern genetics (after the genetic code was laid out, anyway) is the role of genetic variations in evolution, disease and the creation of plants and animals. While the Human Genome Project (and a lot of other genome projects, for that matter) showed how many genes living things share, they also demonstrated the importance of variants.
Variants are key to successful evolution: genotype changes (usually of the smaller type) can lead to changes in phenotype. So, what kinds of variants are important to disease, development and even survival?
Naturally, geneticists and other scientists are very interested in these variants. One problem with studying them is understanding the functional consequence of these rare variants. While over the past 50 centuries nearly one million variants have appeared in humans. The majority of these variants are unlikely to impact our health, but there are a number that have been connected to disease. The search has been compared to finding a needle in a stack of needles.
Today’s Key Variants
Really an umbrella term, referring to SNP/SNVs, indels, copy number variations and a number of other variants that change the sequence of base pairs in a genome. These variations, while small compared to a frameshift mutation, are increasingly important in understanding human diseases. In fact, it’s been found that nearly all human tumors have some structural variants (some just a handful, others in the thousands).
Single-nucleotide Polymorphisms/Single-nucleotide Variations (SNP/SNVs)
Known as single-nucleotide polymorphisms (SNPs) in populations and single-nucleotide variations (SNVs) in individuals, these variants are simply exchanges of one nucleotide base pair for another. There are several million SNPs in the average human, and perhaps as many in plants. These have become very important markers for certain diseases, and will no doubt serve as guideposts for the development of personalized treatments. A recent study, in fact, showed that while an individual SNP or two did not appear to correlate with cancers, a group of 77 SNPs did seem to be strongly associated with the development of breast cancer.
Short for “insertion” and “deletion,” these are added or subtracted base pairs in a segment of DNA. It’s estimated that humans have several million of these. More substantial than SNP/SNVs, indels involve between 1 and 10,000 base pairs. Like SNP/SNVs, they most likely play some role in disease and may play an important role in determining personalized medicine. In the disease cystic fibrosis, for example, indels are responsible for the deletion of a single amino acid that triggers the disease.
Copy Number Variations
This refers to differences in the number of specific genes for a certain trait found in a genome. While the “central dogma” taught us that there were two copies of a gene in every genome. However, recent advances have shown that there may be many copies of a gene, or none. And these variations can lead to disease states. These variations may be the most prevalent of all; their large size has meant that they may involve three times as many base pairs as SNP/SNVs, the next-most prevalent structural variation.
Translocations and Inversions
These are chromosomal rearrangements of genes (or at least segments of DNA), in which the DNA segments are broken off, and either located at some other point on the chromosome (translocation), or reinserted into the chromosomal DNA in “reverse,” 180 degrees from its previous alignment (inversions). Generally, the larger the segment of DNA that is subject to these rearrangements, the more likely it will cause a change in phenotype.
Importance of Sample Size
When studying variants the sample size is crucial in order to properly determine whether a variation occurs in just one genome, down to experimental error or is a true significant finding. Sample sizes needed for genome-wide studies, for example, may have to be so huge as to prohibit routine analysis. Click here for a comprehensive review by John Witte, an epidemiologist at UC San Francisco, on finding appropriate sample sizes, and click here for another study.
Genome-Wide Association studies (GWAS) initially showed the extent of these structural variants in the human and other genomes. However, GWAS or whole-genome sequencing (WGS) are still not economical or practical ways to study individual variations and groups that may be associated with the variation in a certain gene or other region of DNA. Below is a variety of methods that can be used to study variants.
Hot Spot Analysis
For well-studied diseases such as certain cancers, some NGS manufacturers sell panels that contain sequences of known cancer genes. For example, Ion Torrent (now part of ThermoFisher Scientific) has a “hot spot” panel that contains 207 primer pairs that match 50 oncogenes and tumor suppressor genes, including KRAS, BRAF, and EGFR. Sequencing is then carried out using any of the brands of next-generation machines (Ion PGM, Roche 454, Illumina MySeq, etc.) This is a fast way to determine if your tumor cell samples show any variations in these genes. However, these panels cannot tell you if you’ve discovered a whole new cancer gene.
Whole Exome Sequencing
When whole genome sequencing proves to be too time consuming or expensive, sequencing the exome (coding regions) is a viable alternative option. This technique assumes, of course, that your variants lie within genes (or other coding DNA). Whole exome sequencing has been invaluable in discovering variants behind diseases, phase of development and normal phenotypic variations. However, for epigenetic studies and searches for de-novo variants that might lie outside of coding regions, this technique isn’t helpful.
Chromatin Immuno-precipitation (ChIP) has been the latest screening method of choice; pulling down cross-linked protein-DNA complexes in a chromatin preparation with an antibody directed against a certain histone modification, which can then be analyzed by qPCR or sequencing. This technique has been useful for detecting variants that may not have been caused directly by expressing DNA (i.e., genes), showing the potential roles played by non-coding DNA and epigenetics in the development of variants.
Variant Calling Calls for Good Software
Whatever sample is used, and however this sample is sequenced or isolated, variants still need to be accurately called, or identified. Because the amount of data generated by (certainly) whole genome sequence and even exome-specific sequences is enormous, bioinformatics techniques have been developed to determine true variants and (one hopes) weed out false positives and negatives. For a detailed review of various software available for variant identification and analysis see Pabinger et al. 2013.
Hong, E.P., and Park, J.W. (2012). Sample size and statistical power calculation in genetic association studies. Genomics and Informatics. 10(2): 117-22.
Kumar, S., et al. (2012). SNP discovery through next-generation sequencing and its applications. International Journal of Plant Genomics. 831460.
Mayo Clinic (2015). Mayo Clinic researchers combine common genetic variants and other factors to improve breast cancer risk prediction. Press release.
Mullaney, JM, et al. (2010). Small insertions and deletions (INDELs) in human genomes. Human Molecular Genetics. 19:R131–6.
Pabinger, S., et al. (2013) A survey of tools for variant analysis of next-generation genome sequencing data. Briefings in Bioinformatics. 15:256–27.
Witte, J. (2012) Rare genetic variants and treatment response: sample size and analysis issues. Stat Med 31(25): 3041-50.