We first heard about Catalog, a pioneer in DNA-based data storage in October 2020 and did an interview with David Turek, its CEO and an IBM alumnus.
Almost a year later, they've announced a $35 million Series B round, led by Hamwha Impact Partners and plans to launch its first chemical-based computing platform which combines both data management (and storage) and computation via synthetic DNA manipulation.
The time was therefore right to catch up with Catalog and put its CEO, Hyunjun Park, in the interviewee seat.
- Check out the best cloud storage for photos
- Here are the best rugged hard disk drives
- These are the best external hard drives out there
1. So what is the latest on Shannon? What has happened since we last interviewed Dave Turek (the CTO of Catalog)?
Over the past year, CATALOG has worked with several leading IT, Energy, and Media and Entertainment companies on collaborations to help advance the technology for commercialization. Through this work, CATALOG has discovered broad applicability of our platform across industry sectors, as well as nearly universal demand for what DNA-based computing promises among heavy data users. Early applications that we can speak about at the moment include digital signal processing, such as seismic processing in the energy sector, and database comparisons, such as fraud protection and identity management in the financial industry.
2. Right now Shannon is a bit like the ENIAC of its generation: bulky, slow, expensive, limited but groundbreaking. If we were to fast forward to 2030; how would Shannon v10 look like?
Shannon helped prove that the process of automating and scaling DNA-based storage and now DNA-based computation was achievable. For this purpose alone, it was important to build Shannon. As we move a decade out, future versions of the technology will be smaller and more portable, faster, and more efficient. It's certainly conceivable that by 2030 you could see desktop and pocket size versions of Shannon available and using very small amounts of energy for both storage and compute.
3. DNA in computing is usually associated with storing data. Catalog wants to bring DNA into algorithms and applications? But how?
By computing with DNA we mean the transformation of data encoded in DNA into some new kind of information. For example, if I have an input file of two large numbers, multiplying them together creates a number that was not previously present in the file—this is new information which represents the product of the two pieces of data. We believe that we can create a set of chemical “instructions” which can operate on DNA encoded data to create new information. Examples would include problems in optimization (finding the biggest, the smallest, the best of something in finance, logistics, manufacturing), problems in signal processing (applied in areas like seismic processing in the oil and gas industry), and problems in inferencing and machine learning to begin with. The advantage with DNA is that we can perform these operations at extreme levels of parallelism meaning we can apply billions or trillions of compute agents to work collectively to solve the problem at hand. Each of the compute agents (likely composed of a collection of molecules) will be relatively weak as a compute engine, but the opportunity to bring billions or trillions together to work a problem will potentially dramatically reduce time to insight.
Another domain of interest to us is search. We can use chemical instructions to quickly find data objects encoded into DNA independent of the volume of data. This means that as the amount of data we are searching grows, we can employ chemical search techniques which will essentially be independent of the volume of data—time to solution will remain more or less invariant. That is not the case in many electronic search applications today and the reason for the difference is that a DNA store is a collection of molecules floating in a liquid and independent of the kind of physical organization that exists with electronic media: a tape cartridge has to inspected in serial fashion because that is how it is physically organized (A precedes B which precedes C and so on). In a DNA file the molecules are all jumbled together in a liquid and can be searched directly.This compresses time to insight and reduces cost.
4. Your funding news also mentions that DNA-based computation is expected in 2022? What does that mean and will it be more widely available?
By next year, CATALOG will demonstrate the value of DNA-based computation through a specific business use case. It will likely show the business value of analyzing data previously sitting in cold storage in one particular industry. Our expectation is that as use cases expand we will allow clients to access our technology via the Web as a service (sometime in 2024); we also contemplate the possibility of building miniature devices capable of executing computation on a customer premise at some point subsequently
5. Right now, a sample of DNA-based storage looks like an orange substance in a test tube. What shape/size will it ultimately take?
DNA-based storage is molecules of DNA floating in a liquid (orange in CATALOGs case because of the composition of the inks we use to encode DNA) or perhaps a pellet of DNA for long term storage. There is great utility to having the storage be in liquid form because it presents the opportunity to find “records” in the file directly: we can create probes which, once inserted into the file, will find the targeted record or datum directly.
6. I asked Catalog one question last year and it was “how much will it cost?” Do we have an answer right now that we can share? What sort of storage density are we looking at, and what sort of cost per stored PB or TB?
The first commercialization option for DNA storage, followed by DNA-based computation, will likely be delivered as a service. We will announce pricing models a bit closer to the availability of that offering. The objective is to be approximately equal to conventional storage but to express value by virtue of dramatic improvements in areal density (a million times more dense than electronic media), effectively infinite longevity, and the avoidance of technology obsolescence: DNA written today will be readable at any time in the future because DNA does not change: there are no such issues such as firmware, OS, or device upgrades that are concerning.
7. Right now, what are the biggest obstacles to the rapid development of the storage/computational capabilities of DNA and what’s being done to solve them
Right now the obstacles are engineering in nature and focus on matters clients have viewed as consistently important with respect to any computational technology: reliability, price performance, availability, consistency and so on. We have a dedicated team of engineers, chemists and computer scientists sorting through each of these issues to create the kind of value metrics clients are accustomed to. This includes miniaturization of the current machine, the expansion of automation covering the entire process, and the design and implementation of software infrastructure and tooling desired by clients.
8. What are the current solutions being looked at to solve the throughput problem (e.g. 10MBps written is only 26TB per month).
he current throughput attributes of Shannon are meant to help CATALOG better understand limiting impacts of design choices we made on the machine including the implication of scaling the chemistry underlying our encoding and computational models. We can adjust the throughput by changing some of the performance parameters on the current system and this would have an impact of a few orders of magnitude. But we have begun to lay out other design choices that could go quite far beyond even that improvement. For example, the addition of incremental ink jet print heads has an exponential impact on the throughput of the machine. This is just one example of many adjustments or design choices available to us.