DNA: The perfect backup medium

DNA storage could change the way we store and archive information.

It wasn’t enough for Dr. George Church to help Gilbert “discover” DNA sequencing 30 years ago, create the foundations for genomics, create the Personal Genome Project, drive down the cost of sequencing,  and start humanity down the road of synthetic biology. No, that wasn’t enough.

He and his team decided to publish an easily understood scientific paper (““Next-generation Information Storage in DNA“) that promises to change the way we store and archive information. While this technology may take years to perfect, it provides a roadmap toward an energy efficient, archival storage medium with a host of built-in advantages.

The paper demonstrates the feasibility of using DNA as a storage medium with a theoretical capacity of 455 exabytes per gram. (An exabyte is 1 million terabytes.) Now before you throw away your massive RAID 5 cluster and purchase a series of sequencing machines, know that DNA storage appears to be very high latency. Also know that Church, Yuan Gao, and Sriram Kosuri are not yet writing 455 exabytes of data, they’ve started with a more modest goal of writing Church’s recent book on genomics to a 5.29 MB “bitstream,” here’s an excerpt from the paper:

We converted an html-coded draft of a book that included 53,426 words, 11 JPG images and 1 JavaScript program into a 5.27 megabit bitstream. We then encoded these bits onto 54,898 159nt oligonucleotides (oligos) each encoding a 96-bit data block (96nt), a 19-bit address specifying the location of the data block in the bit stream (19nt), and flanking 22nt common sequences for amplification and sequencing. The oligo library was synthesized by ink-jet printed, high-fidelity DNA microchips. To read the encoded book, we amplified the library by limited-cycle PCR and then sequenced on a single lane of an Illumina HiSeq.

If you know anything about filesystems, this is an amazing paragraph. They’ve essentially defined a new standard for filesystem inodes on DNA. Each 96-bit block has a 19-bit descriptor. They then read this DNA bitstream by using something called Polymerase Chain Reaction (PCR). This is important because it means that reading this information involves generating millions of copies of the data in a format that has been proven to be durable. This biological “backup system” has replication capabilities “built-in.” Not just that, but this replication process has had billions of years of reliability data available.

While this technology may only be practical for long-term storage and high-latency archival purposes, you can already see that this paper makes a strong case for the viability of this approach. Of all biological storage media, this work has demonstrated the longest bit stream and is built atop a set of technologies (DNA sequencing) that have been focused on repeatability and error correction for decades.

In addition to these advantages, DNA storage has other advantages over tape or hard drive — it has a steady-state storage cost of zero, a lifetime that far exceeds that of magnetic storage, and very small space requirements.

If you have a huge amount of data that needs to be archived, the advantages of DNA as a storage medium (once the technology matures) could quickly translate to significant cost savings. Think about the energy requirements of a data center that needs to store and archive an exabyte of data. Compare that to the cost of maintaining a sequencing lab and a few Petri dishes.

For most of us, this reality is still science fiction, but Church’s work makes it less so every day. Google is uniquely positioned to realize this technology. It has already been established that Google’s founders pay close attention to genomics. They invested an unspecified amount in Church’s Personal Genome Project (PGP) in 2008, and they have invested a company much closer to home: 23andme. Google also has a large research arm focused on energy savings and efficiency with scientists like Urs Hozle looking for new ways to get more out of the energy that Google spends to run data centers.

If this technology points the way to the future of very high latency, archival storage, I predict that Google will lead the way in implementation. It is the perfect convergence of massive data and genomics, and just the kind of dent that company is trying to make in the universe.

Related

Sign up for the O'Reilly Programming Newsletter to get weekly insight from industry insiders.