Pilot project to boost data sharing, accessibility
A team from the Broad Institute, the University of California, Berkeley, and the University of California, Santa Cruz, was awarded one of three National Cancer Institute (NCI) Cancer Genomics Cloud Pilot contracts with the goal of building a system that will enable large-scale analysis of The Cancer Genome Atlas (TCGA) and other datasets by co-locating the data and the required computing resources in one cloud environment. This co-location will enable researchers across institutions to bring their analytical tools and methods to use on data in an efficient, cost-effective manner, thereby promoting democratization and collaboration across the cancer genomics community. Seven Bridges Genomics, Inc. and The Institute of Systems Biology, in collaboration with Google, are the two other awardees.
Large-scale sequencing efforts are helping researchers understand the genetic changes that lead to cancer and have led to the development of several successful, targeted chemotherapies. These developments show that identifying mutations that drive cancer can translate into therapeutics. However, three main challenges remain: first, processing massive sequence datasets requires costly computational infrastructures for which few groups have the resources; those that do have the resources often end up duplicating each others' engineering and analysis efforts. Second, data generation is outpacing the development of tools and methods that can be used on such large datasets: already, petabytes of data exist, and exabytes -- 1,000 times a petabyte -- are to come. Finally, data is being collected and stored in silos, minimizing the potential for synergy, data sharing and integrated analysis. To more fully understand the magnitude of a petabyte, if the average MP3 encoding of music requires around 1MB per minute, and the average song lasts about four minutes, then a petabyte of songs would last over 2,000 years playing continuously.
The impetus for the cancer genomics cloud pilots grew from an inquiry from the NCI posed in April 2013. The Institute asked the NCI grantee community to describe their most frequent computational challenges. From these responses, six general themes emerged: data access, computing capacity and infrastructure, data interoperability, training, usability, and governance. The Broad-University of California Cloud Pilot (BUCCP) is addressing these gaps in cancer genome analysis by building a platform for data aggregation and analysis on a computing cloud. This will combine a production environment for running analyses with robust security and access control together with a scalable paradigm for distributed data storage and computation. The BUCCP system will host The Cancer Genome Atlas (TCGA) data and will be pre-populated with commonly used computational tools to immediately empower the cancer genomics research and biomedical community. In addition, the team will develop strategies to engage the community and demonstrate the capabilities of the platform.
Gad Getz, Ph.D., of the Broad Institute, is the lead principal investigator of the BUCCP and will be leading the Broad team together with Matthew Trunnell and Anthony Philipakis, M.D., Ph.D.
"The Cancer Genomics Cloud Pilots will allow the cancer research community to collaborate in a way that has not been possible before," said Getz. "We'll now be able to share data and tools and jointly learn from the totality of cancer genomics data. Our cloud system will democratize access to computational tools for non-experts as well as empower developers with a platform for creating the next generation of analytical methods."
The BUCCP will build on the Broad Institute's successful experience as a leader of TCGA analysis through Firehose, a data analysis and management system that was developed at Broad. This will be leveraged together with the past experience of the UC Santa Cruz team, led by David Haussler, Ph.D., in building and operating the NCI's Cancer Genomics Hub, as well as with the long-standing efforts of the UC Berkeley team developing tools for efficient computing over genomics data, led by David Patterson, Ph.D., a pioneer in distributed computing.
This effort is firmly rooted in the data-sharing principles set forth by the Global Alliance for Genomics and Health (GA4GH), of which Haussler, Patterson, Getz, and Philipakis are working group members, making it both technically-driven and mission-driven from its incipience. The pilot awardees will collaborate with each other and with the NCI Genomics Data Commons (GDC) at the University of Chicago, where the data will be hosted, as well as with the NCI staff and leadership towards a shared vision of a cohesive data and analysis infrastructure to advance the understanding and treatment of cancer.