Using Synthetic DNA For Long Term Data Storage

The amount of data requiring long-term storage is growing and accelerating. Current long-term digital storage technology cannot keep up. Imagine roughly 2.5 QUINTILLION bytes of data being created everyday in this world^1–2 as more computers and network infrastructure come online. For average users, a long-term storage solution is probably not an issue. However, organizations and corporations required to store huge amounts of digital transaction data, there is an urgent need to find new data storage solutions.

Current optical disks and tapes are awesome, but have some obvious drawbacks. For example, the average life of a hard disk is three to five years, and optical CDs are ten years at best. Long-term storage solutions such as tape drives or optical BluRay discs store roughly 10 Terabytes (TB) of data. When stored under ideal temperature and humidity conditions they can last for several decades. However, both storage media formats are bulky and require physical space for storage.³

What Can We Do if Technologies are Becoming Obsolete?

Naturally, we turn to mother nature for a brilliant solution. The answer is in our DNA! Yes, you got that right. Think about how nature evolved a long-term storage solution for us to pass down genetic information. Storing data using DNA might sound like science fiction, but conceptually it’s not that difficult. You see, all digital information is governed by zeros and ones. In our genetic code, there are four nucleotide bases. So how about assigning each base either a zero or one value? That’s exactly what George Church’s team did with synthetic DNA.⁵ It’s so simple yet most elegant.

Theoretically, storing digital info using synthetic DNA has the certain advantages. First, in terms of the average weight to capacity ratio, DNA provides tremendous storage capacity in just a tiny speck. As a rough estimate, about 1 kg of DNA could store the world’s data today.⁴ A second great thing about DNA is that it can be quite stable and, therefore, long lasting. So stable that scientists are talking about obtaining the entire DNA genome from the now-extinct wooly mammoth.⁶

Image Larger Volumes with the UltraMicroscope Choros™

From: Miltenyi Biotech

Trust Your Quantification with the DeNovix DS-8X Rapid Eight Channel, 1µL UV-Vis Spectrophotometer

From: DeNovix

Here is how two teams used different approaches to code binary data into the four-letter DNA alphabet.

The Steps in Using Synthetic DNA for Storage

Convert your binary code into nucleotide code.
Separate your codes into bits and insert an address code for each fragment (including spacers for amplification).
Synthesize your DNA as short fragments of oligonucleotides.
Store in the freezer.
When you need the archive, PCR amplify it and sequence it.
Analyze the sequence data and re-assemble it.
Finally, turn it back to binary code!

Church’s DNA Code

Church’s group encoded a total of 5.27 megabits of digital data into DNA codes.⁵ In computer terms, a bit is the most basic unit for data storage. Basically, a bit contains either a zero or one. The digital DNA data contains an HTML-coded draft of a book with 53,426 words, 11 JPEG images and one JavaScript program! They did all this by breaking up the digital data into 54,898 fragments of 159 nucleotides(nt)-oligonuelcotides with a 96-bit data block with a 19-bit address code, and finally flanked by the 22-nt common sequences for amplification (96+19+ 2(22) = 159).

It is interesting to note that they encode one bit per base (A or C for zero, G or T for one), but the theoretical maximum of encoding is 2 bits per base. In addition, to avoid the reading error by the polymerase during sequencing, they sequenced the samples multiple times and built a consensus call at each base. This created highly overlapping coverage to ensure minimal error.

Goldman’s DNA Code

The Goldman group had a different solution to the same problem⁷. The major difference was that Goldman used a more sophisticated encoding system to ensure no sequence repeats, which can be a big problem in sequencing and analysis. Instead of using Church’s simple method (A or C for Zero, G or T for one), they turned all the binary codes into a series of triplet codes (so instead of 0 & 1, they used 0,1 & 2). Then a synthesis machine created a DNA code from the triplet codes. The triplet code system ensures there are no repeats in nucleotide sequence. Finally, they used overlapping-100bp sequences that gradually shift by 25 bases to ensure minimal errors during analysis.

They were able to encode five files into a long stretch of DNA sequence including:

154 of Shakespeare’s sonnets,
a 26 second audio clip of Martin Luther King’s famous “I Have a Dream” speech,
a copy of James Watson and Francis Crick’s double helix paper,
a photo of their research institute,
and finally a file on the encoding method of the digital data. Now this is truly amazing!

Future considerations

There are pros and cons to DNA data storage:

Pros

High data density— At the molecular level, the digital storage density is at least a million-fold higher than any of our current technologies

High stability— There are plenty of samples on how DNA evidence from decades ago can provide clues to the identity of an individual. For example, scientists obtained genetic data from a wooly mammoth buried in a frozen tundra after thousands of years.

Easy storage— DNA molecules wind their way into a tiny package, providing enormous space saving for digital storage.

Cons

Reagents— A polymerase is required for you to decipher and decode the DNA. That could mean time and person-hours in a specialized lab.

Synthesis costs— Although the price of DNA synthesis is going down, it can still be expensive to synthesize a digital library using DNA. In addition, the time and process required to synthesize each fragment and the encoding can be time-consuming.

Sequencing costs— Cost of sequencing is plummeting, but it could mean that the digital information is not easily accessible if you do not have a sequencer or sequencing facility nearby or enough money to fully decode your data.

Coverage issues— As we have discussed, to achieve 100 percent accuracy, multiple coverage and sequencing reads are required. In the future, more clever ways of compressing and encoding data will have to be used to make digital DNA storage a reality.

Long-term storage format— Right now we are still not sure how to best store the DNA. Should we store it in wet or dry? Embedded in a matrix or as part of a chip?

Is long-term DNA storage going to become a reality? Well, that depends on a few factors. While this technology exists and offers far greater digital storage density than our current storage solutions, the technology is still in its infancy.