# Probability Theory and Molecular Barcodes

Content sponsored by Sigma-Aldrich® Advanced Genomics

In biology, a molecular barcode is a characteristic DNA sequence used to distinguish and gather together similar items. Such a simple but powerful concept is useful in various applications. As an example, the Barcoding of Life project aims to identify specimens through the sequencing of standard gene regions, and use these as *barcodes*. On the other hand, the synthetic molecular barcodes, common in improving most NGS technologies, also have the same name. Both of these types of barcodes share a discrimination purpose, but they are slightly different molecular objects.

## Molecular Barcodes In NGS

Practically speaking, barcodes in NGS are strings of nucleotides that act as unique identifiers.^{1} To date, the barcoding strategy is mainly used to tag populations of molecules. As an example, this approach enables parallel sequencing of different samples through the labeling of each library with a distinguishing barcode, or index (e.g. Illumina indexing). The procedure is straightforward: during the library preparation, ligate the fragments from each sample to a short, characteristic barcode so that they all share a common sequence. Then sort the reads according to their barcode and analyze the samples individually.

More recently, barcodes have been employed as unique molecular identifiers (UMIs).^{2} Here, such barcodes are ligated to each fragment that will be amplified and sequenced. In this way, they serve as controls for the unavoidable errors introduced by PCR, such as artifacts and biased amplification. Thanks to the UMIs, you can take these errors into account and assess the absolute number of the fragments in your library.

To design UMIs, use a string of degenerate nucleotides with different nucleotides at each position, to make peculiar sequences. If you actually want the collection of barcodes to make each fragment distinct, they should not be identical. With longer barcodes you have a lower probability of matching between two or more sequences – which is good. However, the greater the length of the UMI, the higher the portion of each read occupied by such a sequence. Thankfully, there is a compromise. Let’s do some statistics!

## The Generalized Birthday Problem

The birthday problem concerns the probability that in a group of people at least two share the same birthday. This is a well known statistical problem that can be generalized and quite easily solved with approximations (e.g. using the Taylor series expansion^{3}).

Except here the question is: what is the chance that two barcodes of a given sample have the same sequence? You should note some similarities…

## How To Pass The Exam With Flying Colors

You need to virtually investigate each possible pair of barcodes to check the presence of at least one match. One of the simplest equations to do this calculation is the probability of events for a Poisson distribution. Note that the larger the numbers brought into play, the more precise the results.

The Poisson distribution is defined by one parameter. This is the event rate and it is simply the product of number of events e and probability of the event, *p*.

Regarding the barcode problem, e coincides with all the possible combinations of *m* in groups of two elements, where *m* is the number of different barcoded molecules of the library. This is calculated as the binomial coefficient indexed by *m* and 2.

On the other hand, p is the probability that two random barcodes are identical. Let n be the number of positions in each barcode, thus the probability that two random barcodes are identical is 1/4^{n}.

Thus,

If X indicates the number of matches,

The probability of a match then becomes 1 minus the probability that no match will occur.

## For Example

Assume you have 10-nucleotide barcodes (n=10) and a population of 1000 molecular barcodes (m=1000). In this case, the probability that two degenerated strings are identical is 38%. The binomial is calculated as 1000!/[2!(1000-2)!] = 499500. This is divided by 4^^{10} = 1048576. Thus: P = 1 – e^ ^{-(499500/1048576)} = 1 – e^^{-(0.476360321)} = 1 – 0.621039668 = 0.38.

Quite high, huh?

As previously anticipated, longer molecular barcodes will lead to a noticeable decrease of this probability. At each addition of one nucleotide P_{X>0} drops, achieving 2.9% with 12-mers and .05% with 15-mers. Moreover, the number of molecules and the probability of match behave correspondingly. Double the m-value in the first example and you will obtain an approximate value of 85%.

### References

- Shoemaker DD
*et al.*(1996) Quantitative phenotypic analysis of yeast deletion mutants using a highly parallel molecular bar– coding strategy.*Nat Genet*. 14: 450–456. - Kivioja T
*et al*. (2011) Counting absolute numbers of molecules using unique molecular identifiers.*Nat Methods*. 9: 72–74. - Casbon JA
*et al*. (2011) A method for counting PCR template molecules with application to next-generation sequencing.*Nucleic Acids Res*. 39: e81.