Baylor College of Medicine, Amazon Web Services, and DNAnexus team up to run the largest ever cloud-based analysis of genomic data from over 14,000 patients
The Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium aims to advance our understanding of human genetics and how it contributes to heart disease and aging. The discoveries that CHARGE makes will be instrumental in understanding disease and aging in mechanistic detail, enabling the development of new medical interventions and analysis tools.
According to the World Health Organization, heart disease and stroke kill some 17 million people a year, almost one-third of all deaths globally. By 2020, heart disease and stroke will become the leading cause of both death and disability worldwide, with the number of fatalities projected to increase to over 20 million a year and by 2030 to over 24 million a year.
As one of five institutions participating in the global CHARGE Consortium, The Human Genome Sequencing Center at Baylor College of Medicine (HGSC) found that their local compute and storage infrastructure could not handle both their existing requirements and the massive additional analysis load generated by the CHARGE project.
Their options included quadrupling their current compute core capacity for this short-term project or "jamming the cluster" for 3-4 weeks in an attempt to complete the job.
To address this challenge, the HGSC, DNAnexus, and Amazon Web Services (AWS) teamed up to deploy a cloud-based infrastructure that could handle this ultra-large scale genomic analysis project.
DNAnexus provided the platform-as-a-service (PaaS) on top of AWS infrastructure enabling HGSC and 300 investigators to upload, analyze and collaborate on nearly 1PB of data quickly and flexibly, with zero capital investment.
As part of its participation in the CHARGE Consortium, HGSC utilized the DNAnexus platform to analyze the genomes of over 14,000 individuals, encompassing 3,751 whole genomes and 10,940 whole exomes, using their Mercury pipeline.
Mercury is a modular set of semi-automated tools for the analysis of next-generation sequencing data, focused on the delivery of annotated SNP and indel variants. Both the pipeline's resulting output from the analysis of the CHARGE data, and the analysis tools themselves were then made available to over 300 researchers aggregating the data of five large cohort studies around the world.
This project represents the largest genomic analysis performed in the cloud. 300 researchers across five collaborating institutions around the world. 3.3M hours of computation and nearly 1PB generated in 4 weeks.
Over the course of a four-week period approximately 3.3 million core-hours of computational time were used generating 430TB of results and nearly 1PB of data storage hosted for further analysis. The job was completed 5.7x faster, and more reliably than could have been accomplished using the on-premise cluster. This project represents the largest genomic analysis performed in the cloud.
At the project's peak, HGSC was able to spin up 20,800 cores on-demand in order to run the analysis pipeline on the CHARGE data. During this period, HGSC was running one of the largest genomics analysis clusters in the world, without any capital investment or sacrifice of their local cluster capacity.
CHARGE is one of the largest scientific consortia to have a direct impact on human health.
Like the 1000 Genomes Project and ENCODE, the CHARGE project looks at a massive amount of genomic data from different locations globally.
However, the CHARGE project goes a step further to include the examination of phenotypic data to create a statistical model that can find genuine associations with novel genetic loci.
The HGSC Mercury Pipeline is a semi-automated and modular set of tools for the analysis of NGS data in clinically focused studies. HGSC designed the pipeline to identify mutations within a person’s genomic data, determining whether they have serious disease-causing mutations.
The CHARGE project looks at a population-wide level. In short, the scientific questions answered are: what genes and mutations are associated with good and bad health conditions in a large population. For example, in Pubmed article Nature Genetics (2013) doi:10.1038/ng.2671, CHARGE demonstrates using whole genome sequence information to find genes associated with high levels of high-density lipoprotein, which has been associated with better cardiovascular health, and provides tools that allow other investigators to perform similar analysis on other conditions.
Learn more about the CHARGE Consortium