Introduction to Cloud Computing for the Analysis of Large Human Datasets

Mar 07, 2017

Biomedical research is increasingly data-intensive. Technological innovation in recent years has led to an exponential increase in the throughput of assays like DNA sequencing while driving costs down so that a large percentage of funded investigators can utilize them. An oft-cited statistic is that the cost to sequence a human genome has dropped from close to half a billion dollars in 2001, when the Human Genome Project released its working draft, to less than a thousand dollars today. Although there is some disagreement about the exact numbers, the order of magnitude is not in question. Omics technologies are increasingly ubiquitous in the biomedical sciences, particularly in basic and pre-clinical research. Other sources of large or complex data include imaging, electronic health records, and mobile sensor devices. As precision medicine advances, we can expect to see investigators increasingly utilize these technologies in clinical trials.

Large or complex data often requires significant computing infrastructure. The analysis may require compute clusters or expensive high performance compute nodes. Although terabyte-sized hard drives are now relatively inexpensive, the cost to store significant genomic datasets can be substantial when factoring in the costs of server and network infrastructure, backups, and administration. The increasing scale of omics and other technologies coupled with the rapid capital depreciation of computer infrastructure ensures that significant compute and storage costs will be ongoing. Thus, building and operating a scientific computing resource for an organization can be an expensive proposition.

Major public cloud computing platforms like Amazon Web Services, Microsoft Azure, IBM Cloud, and Google Cloud Platform have become extremely popular in recent years. Through economies of scale and mechanisms enabling clients to provision and decommission a variety of virtual compute, storage, and networking resources on demand, these cloud platforms are able to provide an alternative to the traditional data center that can be significantly less expensive. In general, cloud platforms like these provide two categories of service: infrastructure as a service (IaaS) and platform as a service (PaaS). IaaS includes basic compute resources like servers, data storage, and networks. A person conducting a large data analysis may “spin up” a cluster of virtual servers, install the needed software, run the analysis, decommission the servers, and pay only for the time that servers were in operation. In a traditional data center, the administrators must perform careful capacity planning to ensure there are sufficient resources to support demand. In shared computing environments like a traditional compute grid, administrators must also provide resource allocation mechanisms and schedulers to allow analyses from multiple users to run. When capacity is relatively large, analysis jobs can start sooner and execute faster. However, the infrastructure will be underutilized during periods of low demand. Obviously larger capacity means larger cost. When capacity is relatively small, analysis jobs will wait longer before starting and take longer to execute.

IaaS eliminates this need for capacity planning as resources can be created and decommissioned on demand. It also allows analysis to be conducted more rapidly as essentially unlimited compute resources are available. Additionally, all major cloud vendors enable developers to create “machine images,” which allow users to spin up new virtual servers that come with preinstalled software, libraries, and reference data enabling the user to avoid the often difficult task of setting up an analysis environment manually.

PaaS provides higher level services that may even abstract away infrastructure so that users do not need to concern themselves with servers and other basic resources. Many PaaS services, such as web servers, are primarily of interest to software developers. However, with the rise in the use of predictive analytics in industries like retail, cloud vendors are providing analysis services that may be utilized by data analysts in the biomedical sciences. For instance, all of the public cloud vendors listed above provide managed Hadoop services. Hadoop is a popular distributed computing platform where large analysis jobs can be distributed across numerous compute nodes and run in parallel. Setting up Hadoop clusters can be complex. PaaS services eliminate this complexity so that analysts can instantiate and decommission clusters at will.

Typically, only technically adept data analysts and technology professionals work directly with Iaas and PaaS services. However, a variety of third parties have developed software as a service (SaaS) tools and platforms that leverage cloud-based IaaS and PaaS services to provide an increased ease of use that allows a wider range of researchers to benefit from the power and economy of the cloud. For example, companies like DNAnexus and Seven Bridges provide user-friendly tools running in Amazon Web Services for the management and analysis of large omics data. Additionally, at least one major public cloud vendor provides a SaaS service for genomic analysis: Google Cloud Platform includes a service called Google Genomics that currently provides tools for the management and analysis of Next Generation Sequencing (NGS) reads, variants, and annotations using high performance technologies Google developed as part of their core search technology.

For investigators using the cloud to analyze large data sets, such as NGS data, transfer of data from its origin to the cloud can present a significant bottleneck. Studies involving NGS can produce terabytes of raw data. File transfer protocol (FTP) and secure FTP (SFTP) are standard and widely-used mechanisms for transferring data across the internet. However, these protocols may not be able to transfer data in the desired timeframe over typical networks. Aspera is a commercial alternative offered by IBM that can significantly expedite the transfer of large data sets to the cloud over the internet. For the largest datasets, transfer by disk is often faster than electronic transfer. Several of the leading public cloud vendors have services for transferring data using hard drives or appliances. For example, Amazon provides an appliance called a “Snowball” that can be used to ship up to a petabyte of data per appliance through a courier.  

Security of protected health information (PHI) must be addressed the same way in the cloud as in a traditional data center. The leading public cloud vendors provide security mechanisms that equal or surpass those utilized in traditional data centers. However, the individual or organization using the cloud must configure those mechanisms properly. Major cloud vendors provide documented best practices for securing systems and data. To help bring clarity, the US Department of Health and Human Services issued guidelines on the use of cloud computing by HIPAA covered entities and business associates. Under these guidelines, organizations can transmit and store PHI in the cloud. However, they must enter into a business associate agreement (BAA) with the vendor. Major vendors offer standard BAAs as well as documentation on achieving HIPAA-compliance in their environments. For investigators conducting genomic data analysis in the cloud, the National Institutes of Health has issued guidelines specifying best practices for security. In the European Union, the use of human data falls under the Data Protection Directive of 1995, which will be supplanted by the General Data Protection Regulation in 2018. Major cloud vendors likewise provide documentation and assistance in complying with European regulation. In other regions, assistance from major vendors is also available.

The economics and power of the cloud will drive its increasing adoption by organizations for the foreseeable future. Data analysts, software developers, and IT professionals will need to become proficient in utilizing the services provided by cloud vendors to manage, analyze, and secure data. Organizations will need to understand the implications and become comfortable storing and analyzing data on hardware not under their roof or under their direct control. Numerous organizations manage and analyze human clinical and large omics datasets in the cloud. Although many may view the cloud with some degree of trepidation initially, this is a path that is becoming well-worn.

By David Hall, Ph.D. is a Senior Research Scientist at Rho

native1_300x100
lorem ipsum