Center for Big Data in Translational Genomics

The Center for Big Data in Translational Genomics is a multi-institution partnership coordinated by UC Santa Cruz to create scalable infrastructure for the broad application of genomics in biomedicine. This multinational collaboration between academia and industry will create data models and analysis tools to analyze massive data sets of genomic information. Such tools can be used to analyze genomes and gene expression data from thousands of individuals to uncover the contribution of gene variants to disease, with an initial focus on cancer. This knowledge will be instrumental in the development of precision diagnostic and treatment methods.

At least half of all diseases have a substantial genomic component, often including contributions from millions of individually rare but collectively common genetic variations. Only by studying the genomes and transcriptomes of very large numbers of individuals will scientists have the statistical power to discover and understand this vital aspect of the genomic contribution to disease. For this genomics must be brought into the big data era, making possible the analyses of huge data sets and wide deployment of precision diagnosis and treatment based on genomic information.

The center will make software solutions interoperable by developing standard application programming interfaces (APIs) and tools at multiple levels, from raw sequence data to genetic variation and functional data, through to systems, pathways, and phenotypes. The overriding goal is to create implementations capable of handling genomics datasets that are orders of magnitude larger than those that can now be handled. The APIs and all academic reference implementations will be open source, while several major corporate partners not funded by the project will provide proprietary implementations, creating a competitive ecosystem of interoperable big data genomics software.

All implementations developed by UC Santa Cruz and our partners will be subject to all-comers extensive benchmarking to identify best-of-breed, and results will be made broadly available. A diverse set of separately-funded biomedical projects will serve as pilots, and their needs will help drive the design. These projects include the pan-cancer, whole-genome analysis project of the International Cancer Genomics Consortium (analyzing 2,000 cancer genomes); the UK10K project (analyzing 10,000 personal genomes); the UCSF-led I-SPY2 adaptive breast cancer trial; and BeatAML, the omics-guided leukemia therapy project at Oregon Health Sciences University.

UC Santa Cruz coordinates this program as one of the NIH Big Data to Knowledge centers. Our U.S. partners include UC San Francisco, UC Berkeley, Oregon Health Science University, Caltech, and several major big data companies. International partners include the European Bioinformatics Institute, the Sanger Centre, the Ontario Institute for Cancer Research, and a computer systems provider.

