How DNA could store all the world’s data
nature.comThe real problem isn't storing the data, it's accessing it. There is no way to address DNA, you can only "shotgun sequence" it. In doing so, you get random fragments of around 200 bases (400 bits). You can't get one such fragment, you get half a billion in one go, currently at a cost of around $5000. (Older, much more expensive technology, got up to 1000 bases... sometimes, and only 100 fragments per machine run.) So how are you going to access your archive? By sequencing the whole thing and (temporarily) storing it on a hard drive?
The manufacturers of modern sequencers (both Illumina and ABI) have been talking about this for at least 7 years (i.e. as long as they've been selling high throughput sequencers). They actually made a weaker claim: According to them, it makes no sense to keep a sequenced genome, because just sequencing it again would be cheaper than storing the data. In these 7 years, it hasn't happened. Instead, ABI's SOLiD technology all but vanished. Actually storing data in DNA is one step further, it's not going to happen for a long time.
(Source: My employer does a lot of sequencing. I talked to sales representatives of both companies, and I work on data sequenced using Illumina's machines. We store that data on spinning rust.)
From what I get from my own research, the talk about HGP-write and a few chats with Nick Goldman (who is a very funny guy) himself, the main problem is neither storing nor accessing (which you can improve by probing and is also not that important as a primary application could be archives) but mostly synthesis which is still at minimum $1 per 10 bp.
And sequencing will become even cheaper when you do not do it from a library prep but in a controlled buffer environment. It is just currently not getting cheaper because there is no incentive for Illumina to do so (similar to Intels position in CPUs), lets hope that ONT, BGI and who ever else still hopes to get some market share (Ion Torrent, PacBio ...) can force them to evolve (project firefly, yeah).
Synthesis is dropping fast, and will drop even faster in the near future. There are a couple of 'humps' in the demand for synthesis. And plateaus in between. Synthesis between 0 and ~200bp gets you all you need for PCR (copy/paste). But if you can't do ~3000bp, you can't make a full-sized gene. So people get used to PCRing everything. And there is simply no proper demand for anything larger.
But with a few new players on the block (Twist, Gen9, and a few other smaller/newer startups), the goal is to hit economical ~2-3kb, at which point the race is back on again, and whole new markets will open up. And the moment that happens, expect the price to drop again. Competition will kick back in and everyone's price will drop.
The size of a moderate plasmid (~5-7,000) is another hurdle, and the size of a small chromosome is another (~100,000).
Also, if you're ordering DNA in pools or bulk (have a good compression algorithm), you can get the price/bp to come down even more.
There are many ways to address DNA for sequencing that doesn't involve shotgunning.
I've often thought that if we ever decide to send nano spaceships filled with engineered DNA to populate other planets like spores we should include human knowledge in the DNA so when the spores turn into an advanced civilization they could read the DNA and learn about their progenitors.
Sounds like the plot to a scifi novel: Scientists discover that so-called junk DNA contains physics equations, along with what appears to be coordinates to a distant star system with an Earthlike planet.
Take my money.
Star Trek:TNG did an episode that is basically this[0][1].
[0] - http://memory-alpha.wikia.com/wiki/The_Chase_(episode)
[1] - https://en.wikipedia.org/wiki/The_Chase_(Star_Trek:_The_Next...
Pretty cool if there were a SETI equivalent for this on our own DNA.
> The researchers' biggest worry was that DNA synthesis and sequencing made mistakes as often as 1 in every 100 nucleotides. This would render large-scale data storage hopelessly unreliable — unless they could find a workable error-correction scheme. Could they encode bits into base pairs in a way that would allow them to detect and undo the mistakes? “Within the course of an evening,” says Goldman, “we knew that you could.”
how does this work? are the mistakes consistent enough that we can design encodings that rely upon them?
FEC (0) basically adds extra information that can be used to fix errors. A simple scheme is to just duplicate all the info (like a backup), but there are much more clever schemes which are much more efficient.
Check the "making memories" infographic midway down the article. One part is that they transcribe the data four times, another is that they only use three of four bases at a certain point. There's probably more checksums though.
Even if you can make it work, DNA stability is poor.
I don't see why you wouldn't use a higher fidelity atomic storage solution.
DNA stability is quite high. To the point where there is actually a movement to get scientists to stop freezing DNA for long term storage because it uses large amounts of energy for no reason.
Well, freezing/thawing will create sheer that destroys the DNA, so I think the reason is different.
You are seriously mistaken. At abient temperature, DNA is very stable. Even in an active environment as a human, it keeps quite stable for about 100 years. Isolated for multiple millenia. It is the reason we can sequence neanderthals etc nowadays. It wont be stable for millions of years but with some redundancy you could easily make it to 100.000
I'm not sure this is correct. Maybe if you completely isolate it from radiation. The human body is constantly repairing DNA damage.
DNA repair mostly results from transcription errors and biological processes, not from radiation damage.
DNA in isolate is pretty stable.
Thanks
Do we need to keep the DNA away from bacteria? Would it not be digested for nutrients or food? Or is that just propaganda from the salesman pushing memory carbon for my looongterm data storage needs?
DNAse contamination would be a big problem too. It would also be a relatively easy way to "securely" erase your data.
And what about additional layers of redundancy ?
DNA is also compressed in a very spectacular way. I wonder a similar compression can be applied to data.
http://thenextweb.com/insider/2016/04/28/microsoft-turning-d...
Microsoft is already initiated towards this.