In biology, a molecular barcode is a characteristic DNA sequence used to distinguish and gather together similar items. Such a simple but powerful concept is useful in various applications. As an example, the Barcoding of Life project aims to identify specimens through the sequencing of standard gene regions, and use these as barcodes. On the other hand, the synthetic molecular barcodes, common in improving most NGS technologies, also have the same name. Both of these types of barcodes share a discrimination purpose, but they are slightly different molecular objects.
Molecular Barcodes In NGS
Practically speaking, barcodes in NGS are strings of nucleotides that act as unique identifiers.1 To date, the barcoding strategy is mainly used to tag populations of molecules. As an example, this approach enables parallel sequencing of different samples through the labeling of each library with a distinguishing barcode, or index (e.g. Illumina indexing). The procedure is straightforward: during the library preparation, ligate the fragments from each sample to a short, characteristic barcode so that they all share a common sequence. Then sort the reads according to their barcode and analyze the samples individually.
More recently, barcodes have been employed as unique molecular identifiers (UMIs).2 Here, such barcodes are ligated to each fragment that will be amplified and sequenced. In this way, they serve as controls for the unavoidable errors introduced by PCR, such as artifacts and biased amplification. Thanks to the UMIs, you can take these errors into account and assess the absolute number of the fragments in your library.
To design UMIs, use a string of degenerate nucleotides with different nucleotides at each position, to make peculiar sequences. If you actually want the collection of barcodes to make each fragment distinct, they should not be identical. With longer barcodes you have a lower probability of matching between two or more sequences – which is good. However, the greater the length of the UMI, the higher the portion of each read occupied by such a sequence. Thankfully, there is a compromise. Let’s do some statistics!
The Generalized Birthday Problem
The birthday problem concerns the probability that in a group of people at least two share the same birthday. This is a well known statistical problem that can be generalized and quite easily solved with approximations (e.g. using the Taylor series expansion3).
Except here the question is: what is the chance that two barcodes of a given sample have the same sequence? You should note some similarities…
How To Pass The Exam With Flying Colors
You need to virtually investigate each possible pair of barcodes to check the presence of at least one match. One of the simplest equations to do this calculation is the probability of events for a Poisson distribution. Note that the larger the numbers brought into play, the more precise the results.
The Poisson distribution is defined by one parameter. This is the event rate and it is simply the product of number of events e and probability of the event, p.
Regarding the barcode problem, e coincides with all the possible combinations of m in groups of two elements, where m is the number of different barcoded molecules of the library. This is calculated as the binomial coefficient indexed by m and 2.
On the other hand, p is the probability that two random barcodes are identical. Let n be the number of positions in each barcode, thus the probability that two random barcodes are identical is 1/4n.
Thus,
If X indicates the number of matches,
The probability of a match then becomes 1 minus the probability that no match will occur.
For Example
Assume you have 10-nucleotide barcodes (n=10) and a population of 1000 molecular barcodes (m=1000). In this case, the probability that two degenerated strings are identical is 38%. The binomial is calculated as 1000!/[2!(1000-2)!] = 499500. This is divided by 4^10 = 1048576. Thus: P = 1 – e^ -(499500/1048576) = 1 – e^-(0.476360321) = 1 – 0.621039668 = 0.38.
Quite high, huh?
As previously anticipated, longer molecular barcodes will lead to a noticeable decrease of this probability. At each addition of one nucleotide PX>0 drops, achieving 2.9% with 12-mers and .05% with 15-mers. Moreover, the number of molecules and the probability of match behave correspondingly. Double the m-value in the first example and you will obtain an approximate value of 85%.
After having discussed what epigenetic mechanisms are and how we’ve learnt about what they do, it is now time to look into how epigenetics affect our lives if things do not go the way they are supposed to go. I hope I have convinced you that epigenetic processes are vital for an organism, in development…
Today, the gut microbiome is garnering a large amount of media attention for its role in human health and disease. From influencing immune responses to impact our brain, the gut microbiome is an important and necessary aspect of our life. So much so, that current investigations in the gut microbiome are focusing on developing biomarkers for…
You don’t need to be told about how next generation sequencing technologies have revolutionized the way we study the genome and the epigenome. Whether you want to look at transcription (RNA-seq), translation (Ribo-seq) genomes (DNA-seq), interactions of proteins and DNA (ChIP-Seq) or to study epigenetic features such as methylation (whole genome bilsulfite sequencing) there are…
Whole genome sequencing (WGS) is becoming increasingly common. Doctors now routinely order it for patients with puzzling diseases. The NHS (National Health Service in the UK) has declared that it will sequence 100,000 genomes over the next few years. Increase WGS…increase ethical questions The direct-to-consumer company 23andme has been experimenting with whole exome sequencing (WES), and another company, DNA…
Next Generation DNA Sequencing (NGS) is a revolutionary new technology that provides biologists and medical scientists with the ability to collect massive amounts of DNA sequence data both rapidly and cheaply. This technology is having a huge impact on many aspects of biology and medicine because it can be applied in so many different ways….
You’re about to start that big project you’ve been dreaming of for years. You’ve identified a potential miracle compound and want to figure out how it affects gene expression. But how are you going to do it: with next gen sequencing or a microarray? Especially if you are new to this area of research, the…
10 Things Every Molecular Biologist Should Know
The eBook with top tips from our Researcher community.