Alliance Computational Genomics and Bioinformatics (CGB) Unit
By Ann L. Oberg, PhD
Associate Group Statistician
Director, Alliance Computational Genomics and Bioinformatics


The Computational Genomics and Bioinformatics (CGB) Unit, previously known as the Bioinformatics Unit, is a part of the Alliance Statistics and Data Management Center (SDC). This unit serves as a hybrid ‘hub’ for bioinformatics activities, providing assistance for study design, conduct, analysis, interpretation and publication of results. Within the Alliance, ‘bioinformatics data’ is understood to encompass genomic data from high-dimensional assays for which the number of markers considerably exceeds the number of patients. Data types commonly utilized include those generated by Next Generation Sequencing and microarray assays. Study goals range from genome wide association studies (GWAS) or hypothesis testing with candidate gene lists, to unbiased discovery studies, and may utilize patients from a single or multiple trials. Faculty members include Drs. Karla Ballman, Liguo Wang, Nicholas (Nick) Larson, Oleksandr (Alex) Savenkov, and me. Staff members include Keith Anderson, Travis Dockter, Gregory Jenkins, and Shaun Riska.

Under the direction of the Alliance Translational Research Program, two current initiatives include working out the process for submission of sequencing data to dbGaP, a public data repository having the goal of distributing data and results from studies performed to understand associations between human DNA genotype and phenotype information, and establishing the bioinformatics data storage pilot, affectionately known as the DataMart. I will provide an overview and update on these initiatives below.

The NIH expects researchers to share large-scale genomic data collected through NIH-funded mechanisms in order to foster further research and ensure economical use of resources. In addition, most journals require these data be deposited prior to publishing a manuscript. In order to fulfill this expectation, these data may be deposited into dbGaP. Genotype information can include information at the single base level, such as single nucleotide polymorphisms (SNPs) or mutations. Alternatively, it may describe larger portions of the genome such as copy number variation (CNV), in which sections of DNA are repeated multiple times, and the number of times varies between people. Phenotype information refers to characteristics of the person such as development of cancer, ability to metabolize a drug, or experiencing an adverse event in response to treatment. The process of sharing such data is highly regulated since it is plausible that some of this information, taken together, could be used to identify a person. Over the past nine months, we have been working to understand the process for obtaining IRB certification allowing data to be deposited. The many institutions participating in Alliance studies, the number of IRBs involved in approving the research, and evolution of patient consent all contribute to the complexity of the certification process. The steps and needed documents are now defined, and three trials have completed the certification review process. From the time all needed documents are obtained, the certification review process takes six to eight weeks.

The Alliance SDC has a high-functioning clinical database. A long-term goal is to store the bioinformatics data as well, and to make it easily sharable amongst Alliance researchers. To meet these goals, many Alliance members have helped to plan specifications for what we call the DataMart. A pilot version of the DataMart focusing on Next Generation Sequencing data is now operational. Data will be stored in dbGaP format. We will begin putting the first dataset into the DataMart in June. Usage metrics will be collected over the next two to three years and used to plan a larger database.

We are excited to grow our collaborations within Alliance, and encourage you to reach out to me or a member of the CGB unit with any questions.  

