Quantcast

Dispelling the Myths of the Cloud for the Skeptical Scientist

“The cloud” may be generating a lot of buzz in the next-generation sequencing community, but is it worthy of all the hype?

Lauded for its highly scalable infrastructure, the cloud allows scientists to go from 0 to 60 instantaneously with DNA data analysis and storage, without ever having to invest and maintain expensive hardware. The future looks bright for the cloud, but can it meet NGS scientist’s lofty expectations?

What exactly is “the cloud”?

According to a national survey, most Americans are confused by cloud computing. Of those who claimed they have never used the cloud, 95% actually do so via online banking and shopping, social networking and photo or music storage. So while the cloud may seem obscure, it’s really just a fancy term for a large off-site datacenter. The function of this datacenter is to provide easy access to virtualized resources (hardware and/or software services). NGS scientists can therefore utilize these resources for the storage and analysis of DNA sequence data.

So what’s all the fuss over the cloud and next-generation sequencing?

The most attractive feature of cloud computing is its elasticity- allowing scientists to add or remove resources instantaneously, depending on the analysis needs. Dealing with large-scale sequence data can be unwieldy- a sequenced human genome contains around 20 gigabytes of data. Depending on the type of scientific experiment, sequence data analysis could vary by hundreds of gigabytes!

The cloud grows with you

The beauty of the cloud lies in its ability to eliminate up-front commitment to expensive hardware and staff members to maintain it. Since the cloud is elastic, a research lab can start small and increase cloud resources once there is a need. This is also true in the case of storage. A few sequenced genomes can easily add up to terabytes of data. With the cloud, your storage grows with your data and scientists need only to pay for what they use. The scalability of the cloud segues into the first myth we’ll explore- cloud computing is too expensive.

Focusing on the Data

Since the cloud’s economic value can be difficult to grasp due to multiple variables, I thought I’d provide you with a simple example of how a small lab might save money using the cloud. Imagine you have to run an experiment that includes the storage, processing and analysis of five human genome samples. If you decide not to work in the cloud, you need to purchase the necessary hardware that allows you to store and analyze this size of data. Not only can the installation of internal infrastructure be challenging for some of us, on top of that maintenance costs can add up quickly. Typically scientists focus on short-term solutions, as they don’t want to invest in additional infrastructure which may be idle for the majority of the time.

Partner with another lab? Too messy!

Now imagine six months later you have another experiment to run, but this time you have to process not five, but 20 human genome samples. However, your current internal infrastructure may be limited and therefore unable to process this next set of data. So the question is, should you invest in additional servers that you may not need the following year? Granted you could possibly partner with a neighboring lab, but balancing a budget across labs can get messy. Honestly, there’s no way around it, managing the storage and computing infrastructure yourself creates a great deal of work and you haven’t even started analyzing your data yet!

Using the cloud for exome analysis

I spoke with Chaim Jalas, Co-Director of Patient Services at Bonei Olam, whose lab contracts out exome sequencing, but does the data analysis themselves. Initially, they were analyzing exomes on their single internal server, but found it too much of a burden if they were analyzing two exomes at the same time. To solve this problem (without having to buy more servers and hire someone to maintain the infrastructure) his lab took advantage of a cloud-based data analysis and management tool. For Jalas’ lab, investing in a data analysis tool, which runs in the cloud, was the simplest solution. It allowed them to run as many samples as they liked while eliminating the need to purchase additional servers and hire someone to maintain them. “Although we never did a direct cost comparison, utilizing the cloud got us back to focusing on the data. That alone is a huge economic factor.” Jalas said.

Can I trust the cloud?

The idea of handing over sensitive data to another company seems to be a huge source of worry for most scientists. Is my data safe? Who has access to my data? Do I have control over my data? These are all legitimate questions. Arguably, for a cloud provider like Amazon or Microsoft, security is their core competency and success directly depends on the safety of their customers’ data.

Similar to the cloud provider, a cloud-based software provider should ensure security in addition to privacy compliance, such as HIPAA. In the case of privacy, people are actually the greater security risk, not technology. Of reported HIPAA breaches, roughly 75% resulted from theft or loss of a laptop or other digital media.

If it’s good enough for NASA…

Take Amazon Web Services (AWS), the leading cloud provider in the world, which is used by a number of high-profile companies for a variety of functions. NASA uses AWS for its Mars Rover activity planning software, the US Treasury hosts their flagship website with it and NASDAQ stores and analyzes its financial trading data in the cloud. In most cases, cloud providers offer greater security than organizations can provide on their own.

Can you stream data in real-time?

A good Internet connection has a bandwidth of 10 megabits per second- under these conditions there is no bottleneck, especially if the sequence data is directly uploaded to the cloud from the sequencing instrument. Previously, for Illumina instruments, scientists would have to wait until the end of the sequencing run for a set of files in FASTQ format to be generated before upload and processing. Sending these large file types over the network en masse can potentially tie up your bandwidth for hours. Nowadays, you can prevent this from happening by using the Illumina Real-Time Analysis, which produces a set of smaller files in the BCL format after each cycle while the run is still ongoing. The method of streaming these smaller bits of data directly to the cloud, during the DNA sequencing, has less impact on the network, freeing up more bandwidth to perform other tasks. Once the sequencing run is complete, you can simply reassemble the FASTQ files from the BCL files on the cloud, and you’re ready to analyze. Not only have you gained time, because you did not wait for the run to complete, but you also simplified your data processing.

Here’s some math to put this into perspective:
-Illumina quotes the throughput of the HiSeq 2000 at 25 gigabases per day or about one gigabase per hour.
-In other words, a HiSeq 2000 produces one gigabyte of sequence data per hour or 290 kilobytes per second.
-If you operate at 100% efficiency: 290 kilobytes/second multiplied by eight bits/byte (for network transmission), this translates into 2.5 megabits per second of data output.

Ultimately the particular mix of instruments and network congestion will impact the required bandwidth, however a five megabit bandwidth would more than adequately support one HiSeq 2000, allowing sequence data to be uploaded to the cloud at the same rate that the sample is being sequenced.

Still not convinced?

To put this into context, people who stream movies over the Internet to their home do so at a higher bit rate! So if you are a small sequencing core with one or two HiSeq 2000s a 10 Megabit connection can easily support your sequence performance.

The cloud is your friend

My bottom-line advice- the cloud is an NGS scientist’s friend. Even though cost savings may be difficult to quantify, in the end it gets you back to data analysis. Trusted cloud providers are like security watchdogs- it’s their core competency and therefore the safest option for your data. Bandwidth is no longer a barrier, sequence data can be uploaded to the cloud in real-time.

Leave a Comment





This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share via
Copy link