HighLights

NCIBI 2009 ARM Posters and Abstracts

1) NIH National Center for Integrative Biomedical Informatics (NCIBI)

James Cavalcoli, H.V. Jagadish, David States, Gilbert Omenn, Daniel Kiskis, Barbara Mirel, and Brian Athey

The National Center for Integrative Biomedical Informatics (NCIBI) is developing the framework of conceptual models, computational infrastructure and an integrated knowledge repository, which modern scientists need to make effective use of the wealth of data flowing from molecular biology and translational research. The NCIBI provides researchers with web-accessible knowledge analysis, collaborative work environments to create and utilize computationally-enabled models, and workflows to better understand complex biomedical processes.

DRIVING BIOLOGICAL PROJECTS

A1) Genomic-Scale Screening for Gene Fusions in Human Solid Tumors by Integrative Biomedical Informatics

Xiaosong Wang, Saravana M. Dhanasekaran, John R. Presener, Bo Han, Nallasivam Palanisamy, Maureen A. Sartor,
Gilbert S. Omenn, and Arul M. Chinnaiyan

The recent discovery of recurrent fusions in prostate and lung cancers stimulated the common interests in searching for causal gene fusions in solid tumors, and inspires drug developments. However, due to limited understanding of basic principles for the chromosome rearrangements, the identification of causal gene fusions from the background of non-specific chromosomal aberrations of solid tumor genomes remains challenging. In this study, we applied an integrative biomedical informatics approach to predict novel gene fusions on the basis of high-throughput biological data.

Through integrative database mining, we analyzed the shared characteristics of established gene fusions in cancer [see http://portal.ncibi.org for tools and databases]. We analyzed the exon expression aberrations associated with translocations; dissected the domain architectures for known fusion proteins; through enrichment analysis we identified the molecular concept signatures of fusion genes; and analyzed unbalanced DNA breakpoints resulting gene fusions with our compendia of high-resolution array CGH data.

We identified the tumor specific promoters which are frequently found in promoter-type fusions; developed an innovative M-Score scheme to quantify the functional relevance of human genes in cancer causation; and deduced a general principle to describe the characteristics of a special group of unbalanced fusions. Based on multiple evidences characterizing the fusion genes, we created an integrative translational bioinformatics model for the prediction of novel gene fusions in human solid tumors (Figure 1). Several comprehensive strategies were developed to address the possibility of recurrent fusions based on available evidences in different cancer types. We applied this approach to predict gene fusions based on public genomic, sequence and functional data, as well as our deep sequencing data.

The computational results were validated by Regular or Quantitative Reverse Transcription PCR (qRT-PCR), Rapid Amplification of cDNA Ends (RACE) and Florescence in Situ Hybridization (FISH). The integrative approach proved to be efficient. Up to 20-50% candidates were validated by different assays.

This work demonstrated that our integrative approach can significantly increase the yield and accuracy of prediction of causal gene fusions. This study lays down a bioinformatics foundation for the discovery of gene fusions from the high-throughput biological data. (This work was supported by National Institutes of Health: Grant #U54 DA021519).

A2) Identification of Novel Splice Variants in Mouse Models for Breast Cancer

Raji Menon, David J. States, and Gilbert S. Omenn

Alternative splicing plays a major role in protein diversity without significantly increasing genome size. Aberrations in alternative splice variants are known to contribute to a number of diseases. The several alternative splice databases now publicly available differ in their annotation and modeling methods and contain many transcripts not present in reference resources like Ensembl or Refseq. The ECgene database is one of the largest alternative splice variant databases [Kim P, et al. Genome Research 2005; 15(4): 566-76]. In this study of potential biomarkers for breast cancer, we have used mass spectrometric data to interrogate a custom-built, non-redundant database created with three-frame translations of mRNA sequences from ECgene and Ensembl to find alternative splice variants. The mzXML files from LC-MS/MS analyses of tumor and normal mammary tissue from a HER2/Neu-driven mouse model of breast cancer [Whiteaker et al, JPR 2007; 6 (10), 3962-3975] were downloaded from PeptideAtlas [ http://www.peptideatlas.org/repository/]. These files were searched against the database using X!Tandem software. We identified 3898 distinct peptides with X!Tandem expect score < 0.001 at a false discovery rate < 1.7%. The peptides were analyzed using NCBI blastp and UCSC blatp. We found 7 novel peptides that occurred only in tumor samples; these peptides did not match completely to any known mouse protein sequence and were identified by more than one spectrum. Six of these peptides either matched to intronic sequences of known genes or partially matched to known protein sequences. For example, ‘RGQKPPAMPQPVPTA’, a novel peptide identified by 3 distinct spectra, had five amino acids missing when compared with the known peptide sequence ‘QKGGKPEPPAMPQPVPTA’ from ribosomal protein S3. The known peptide has a functional motif ‘GGKPEPP’ that is involved in protein-protein interaction mediated by SH3 domains. This motif is missing in the novel peptide we found. We found a novel peptide from the intronic region of Rogdi gene that had a phosphopeptide motif which directly interacts with the BRCT (carboxy-terminal) domain of the breast cancer gene BRCA1 with low affinity. Another novel peptide identified by 3 distinct spectra did not match to any known protein sequence. However, its sequence matched to Mus musculus chromosome 7, clone RP23-49M22 by NCBI tfblast and the peptide sequence had signal peptide and trans membrane regions when analyzed by EBI software InterPro Scan. These data suggest that alternative splice variants play functional roles in tumor mechanisms and are potentially rich sources of candidate biomarkers. More detailed analyses on these proteins are being done.

Acknowledgements: Supported by NCI/SAIC 23XS110A on Mouse Models of Human Cancers, MTTC GR 687 for Proteomics Alliance for Cancer Research, U54 DA021519 National Center for Integrative Biomedical Informatics, and P41 RR018627 National Resource for Pathways and Proteomics.

A3) Transcriptional Profiling of Type 2 Diabetic Mice Reveals Changes in Fat Metabolism

Timothy D. Wiggin, Junguk Hur, Matthias Kretzler, and Eva L. Feldman

Approximately 16 million Americans have been diagnosed with Type 2 diabetes and approximately 3 new cases are diagnosed every minute. Diabetic neuropathy (DN) is a serious complication of diabetes resulting in loss of sensation in the limbs, cardiac complications, and is the leading cause of non-traumatic amputations in the United States. The BLKS-db/db mouse model develops severe Type 2 diabetes, and has symptoms of severe DN by 24 weeks of diabetes. We observe a significant increase in the concentration of oxidized lipid in both dorsal root ganglia (DRG) and the sciatic nerve (SCN) in the diabetic mice.

In order to identify the mechanisms of DN in these mice, we transcriptionally profiled DRG and SCN from mice following 24 weeks of diabetes and age matched controls. The NCIBI GenePattern pipeline was used to identify the significantly regulated genes in each tissue. 2505 genes were significantly regulated in the SCN and 1419 were regulated in DRG, but only a small fraction of these genes were co-regulated in both tissues. Chinese Restaurant Clustering was used to find clusters of co-regulated genes, and clusters enriched for mitochondrial genes were isolated in each tissue. These clusters of co-regulated genes with similar function were analyzed for conserved promoter elements. A three element transcriptional module of two SP1 binding sites and a CTCF site was found that is shared between the tissues and is conserved across species.

One of the regulated genes with this three element module, Acsl1, is highly relevant to lipid metabolism. The expression of this gene in human disease has been confirmed by comparison with a human neuropathy microarray dataset. Because of its possible role in lipid mediated DN, it has been targeted for biological confirmation of the bioinformatics search. We confirmed the increase in gene expression by PCR and western blot analysis. The promoter of the gene has been confirmed to be functional by a luciferase assay.

A4) Identification of Conserved Regulatory Network of Diabetic Neuropathy and Nephropathy

Junguk Hur, Viji Nair, Tim Wiggin, Matthias Kretzler, Frank Brosius, and Eva Feldman

Diabetic neuropathy (DPN) and nephropathy (DN), which result in significant mortality, morbidity and poor quality of life, are frequent complications in patients with diabetes mellitus. To find improved intervention strategies, it is critical to comprehensively understand involved molecular mechanisms and gene regulatory networks of disease progression. To obtain more insight into the processes leading to these complications, gene expression profiles of human sural nerve and kidney tissues from diabetes patients were surveyed using DNA microarray. A total of 4,680 and 4,630 genes were found to be differentially regulated in nerve (between progressive and non-progressive DPN) and kidney (between albuminuric and non-albuminuric DN), respectively. Cross-tissue comparison of transcriptional networks using the suboptimal graph matching tool TALE from the NCIBI suite of tools allowed the identification of a core network of 91 genes, shared by both tissues. This shared network includes well studied diabetes related genes such as peroxisome proliferator-activated receptor gamma (PPARG) and leptin receptor (LEPR). Identification of these key genes confirms the validity of the current approach of pursuing a shared transcriptional network. This conserved gene network is expected to lead us to a better understanding of disease progression and will serve as a starting point to define therapeutic strategies targeting microvascular complication of DM independent of organ manifestation.

A5) Uncovering Genetic Factors Contributing to Type 2 Diabetes and Diabetic Nephropathy

Yongsheng Bai, Viji Nair, Sebastian Martini, James Cavalcoli, Gilbert Omenn, and Matthias Kretzler

Recent genome-wide association studies (The Wellcome Trust Case Control Consortium 2007; Saxena, Voight et al. 2007; Scott, Mohlke et al. 2007; Zeggini, Scott et al. 2008) have made major discoveries in identifying Type 2 diabetes (T2D) associated regions and loci, but the specific sequence variances responsible for the associations remain elusive.

To define putative causative gene sets from GWAS we employed a promoter modeling approach based on the hypothesis that promoter region integrates upstream signaling cascades towards coordinated transcription of functionally interdependent mRNAs. Defining T2D-dependent promoter models in GWAS candidate promoters might thereby facilitate identification of putative causative transcript alterations.

Here we studied the proximal promoter regions of 13 genes selected from T2D associated regions in the 3-way FUSION-DGI-WTCCC meta-analysis. In particular, we used computational methods to identify shared putative regulatory promoter modules in the proximal promoter regions that we investigated. Specific potential regulatory promoter modules containing three transcription factor (TF) binding motifs in a defined order and spacing were identified in a subset of genes chosen from GWAS associated regions. These promoter modules helped elucidate other module sharing genes in the GWAS, which are possibly regulated in a similar fashion.

Our study provides TF binding module data which can putatively activate a subset of T2D GWAS genes.

A6) Complex Genetic Influences on Comorbid Bipolar Disorder with Tobacco
Use Disorder.

Richard C. McEachin, Nancy L. Saccone, Scott F. Saccone, Laura J. Bierut, and Melvin G. McInnis

Comorbidity of psychiatric and substance use disorders represents a significant complication in the clinical course of both disorders. Bipolar Disorder (BD) is a psychiatric disorder that has significant negative effects on the lives of those affected, including a high rate of comorbid Tobacco Use Disorder (TUD). For patients with BD, the risk for TUD is almost 4 times that for the general population (Relative Risk 3.83, 95% Confidence Interval 3.55 to 4.14), based on our meta-analysis of the seven studies that have been published to date. Notably, the observed bi-directional increased relative risk for both disorders is consistent with some common underlying etiology for BD and TUD. Given this potential for common etiology, as well as evidence of genetic influences on both BD and TUD, we hypothesized a common underlying genetic etiology, interacting with environmental nicotine exposure, influencing both BD and TUD. We use multiple NCIBI and outside resources to establish candidate genes for comorbid BD with TUD, then explore these candidates as well as additional candidates, based on their relationships with the established candidates, in the comorbidity. The resulting gene networks show significant association with both BD and TUD, reveal novel inference on this comorbidity, and establish new candidate genes for follow-up testing. This work is funded, in part, by the NCIBI’s Building Bridges Postdoc fellowship.

A7) A Time Course Study of Genome Transcription Regulation in Lymphoblastoid Cells in Response to Lithium Treatment

Haiming Chen, Alan Prossin, Nulang Wang, Margit Burmeister, and Melvin G. McInnis

Lithium (Li) is effective in the treatment of Bipolar Disorder (BD). It is well established that Li interacts with and inhibits GSK3B and IMPases to modulate the wnt and phosphoinositol signaling pathways. However, much remains unknown with regards to downstream gene expression changes affected by the regulation of these pathways. To gain insight into the regulation of transcription activity, we directly examined genome-wide gene expression profiles in lymphoblastoid cell lines (LCLs) with and without Li treatment for periods of 4, 8, and 16 days. We identified 2789, 892, and 1577 gene transcripts (false discovery rate, FDR < 0.05; one for each of the three respective time points) which were differentially expressed in treated LCLs compared to un-treated. A total of 43 transcripts changed expression patterns in treated cells across all three time points. Using the two-class-paired time course algorithms implemented in SAM (Serial Analysis of Microarrays) in data analysis, we identified 218 transcripts that showed significant slope changes over time (FDR < 0.05). Of the 218 significant transcripts, only C8orf33 showed a positive slope change, and the rest showed negative slope changes. C8orf33 is a novel gene mapped to the region of 8q24 linked to BPD. We focused on the analysis of C8orf33 using the Michigan Molecular Interaction database (MiMI) search algorithms, and identified that C8orf33 directly interacted with three other genes (GIT1, GPRASP1, and HAP1). These four genes form a network total of 144 nodes (genes) and 345 edges. Among the 144 nodes, 21% of them were reported to be regulated by Li in mouse brain (McQuillin et al., 2007). Functional annotation using the EASE algorithms suggests that the 144 genes in the network are enriched in biological pathways that may be relevant to the mechanism of Li’s therapeutic action in BD. Notably, G-protein coupled receptor protein signaling pathway (P=2.27E-8), neuroactive ligand-receptor interaction (P = 7.38E-16), calcium signaling pathway (P = 7.90E-10), and regulation of actin cytoskeleton (P = 3.16E-05) are prominent in this network. Our data suggest that a large number of genes change expression patterns in response to Li treatment, however the magnitude is modest. The results reported here suggest that Li-responsive genes in LCLs may be involved in gene interaction networks that may herald a certain predictive potential in terms of Li treatment outcomes in Bipolar Disorder.

A8) HC Prechter Bipolar Repository: Preliminary Report and Analyses.

Scott A. Langenecker, Mohsen Almani, Christine B. Brucksch, Steven M. Brunwasser, Mary Clark, E. Garcia, Iva Grasso, Gloria Harrington, Allison Kade, Masoud Kamali, Laura Phelps, Lisa A. O’Donnell, Stephanie Prechter, Alan Prossin, Michael T. Ransom, Christine R. Grimm, Erika F.H. Saunders, Aaron C. Vederman, and Melvin G. McInnis

The Prechter BP Repository was established in 2005 to contain extensive clinical and biological samples, data and results from a cohort of BP individuals and unaffected controls. Currently over 400 individuals are enrolled; there are 232 with BPI disorder, 46 with BPII or BP NOS, and 65 unaffected control subjects. Extensive clinical, environmental, and neuropsychological data are gathered from all participants, including regular follow-up evaluations.

Diagnosis is conducted via the structured Diagnostic Interview for Genetic Studies, followed by the Best-Estimate diagnostic process. In a preliminary cross-sectional analysis of healthy control subjects (HC) and euthymic (E), depressed (D), or hypomanic/mixed (HM) patients with bipolar disorder (BD), we identified intermediate cognitive phenotypes (ICP) from assessments in executive functioning, attention, memory, fine motor function, and emotion processing. These were from eight domains consistent with previous literature: auditory memory, visual memory, processing speed with interference resolution, verbal fluency and processing speed, conceptual reasoning and set-shifting, inhibitory control, emotion processing, and fine motor dexterity. Intermediate Phenotypes for personality factors from the NEO-PI were Extraversion, Neuroticism, Conscientiousness, and Agreeableness, and Openness to Experience.

- back to top -

NCIBI CORE SUITE OF TOOLS

B1) Enabling GPU Computing in the R Statistical Environment

Josh Buckner, Manhong Dai, Brian Athey, Stanley Watson and Fan Meng

R is the most popular open source statistical environment in the biomedical research community. However, most of the popular R function implementations involve no parallelism and they can only be executed as separate instances on multicore or cluster hardware for large data-parallel analysis tasks. The arrival of modern graphic processing units (GPUs) with user friendly programming tools such as CUDA provides a the possibility of increasing the computational efficiency of many common tasks by more than one order of magnitude. Nonetheless, most R users are not trained to program a GPU, a key obstacle for the widespread adoption of GPUs in biomedical research.

To overcome this obstacle, we decided to devote efforts for moving frequently used R functions in our work to the GPU using the Nvidia CUDA platform. In the ideal solution, if a CUDA compatible GPU and driver is present on a user’s machine, the user may only need to prefix "gpu" to the original function name to take advantage of the GPU implementation of the corresponding R function. For our solution, we take achieving this ideal as one of our primary goals. Our solution aims at enabling a biomedical researcher to harness the computational power of a GPU using a familiar tool. Since our code is open source, researchers may customize the R interfaces to their particular needs.

Using Nvidia’s CUDA toolkit, we have implemented a variety of statistical analysis functions with R interfaces that execute with some degree of parallelism on a Graphics Processing Unit (GPU). The functions provided by our package include GPU enabled calculations of distances between sets of points (similar to the R dist function), hierarchical clustering (similar to the R hclust function), the Granger causality test, and a matrix multiplication method. In our experience, these GPU implementations can show a 3-75 fold increase in computing performance over a quadcore Nehalem 920 processor running four threads at the same time.

B2) A Cytoscape Plugin for Subgraph Matching Using SAGA/TALE

Alex Ade, Yuanyuan Tian, and Jignesh Patel

SAGA, the Substructure Indexed-based Approximate Graph Alignment tool and TALE, a Tool for Approximate Subgraph Matching of Large Queries Efficiently, allow users to match query graphs against a large database of graphs. The biological application of SAGA/TALE allows users to query and compare biological pathways against the KEGG pathway database. Here we describe a Cytoscape plugin that sends query graphs to SAGA/TALE and retrieves the approximate matching graphs. Cytoscape, an open source platform for visualizing molecular networks is an ideal input and display framework for SAGA/TALE.

B3) ConceptGen: A gene set enrichment and concept mapping tool

Vasudeva Mahavisno, Zach Wright, Alla Karnovsky, Gilbert S. Omenn, Brian Athey, James Cavalcoli, and Maureen A. Sartor

Identification of biological concepts enriched in an experimentally-derived gene list has become an integral part of the analysis and interpretation of genomic data. Of additional importance is the ability to explore networks of relationships among previously defined biological concepts from diverse information sources. We will present for the first time ConceptGen, a gene set enrichment and concept mapping tool that integrates gene sets from 13 biological knowledge sources totaling ~ 20,000 concepts and provides a user- friendly web interface. The experimentally-derived concepts include several hundred from public microarray datasets downloaded from Gene Expression Omnibus (GEO), which we analyzed using a custom-built gene expression analysis pipeline incorporating advanced statistical methods and quality control checks. Additional concept types include Gene Ontologies, pathway databases, protein domain families, miRNA target sets, drug target sets, gene-centered protein interaction sets, MeSH-derived concepts, and metabolite-specific gene sets created using published human metabolic networks that link compounds and reactions to enzymes and genes. It can easily be expanded to include experimental data from additional technologies, such as ChIP-Seq, RNA-Seq, or high-throughput metabolomics and proteomics.

Using a modified Fisher’s Exact Test, we pre-computed the significance of overlap among all concepts, and developed a state-of-the-art user interface with Flex technology. Visualizations include a network and heat-map view of significantly enriched concepts. Our novel "gene enrichment analysis" flips genes and concepts to provide a visualization of the relationship network of queried genes based on significant overlap of their concept signatures. Users are provided with private accounts for uploading gene or metabolomic datasets. We demonstrate the usefulness of ConceptGen using a bipolar disorder case study. ConceptGen is part of the freely-available suite of bioinformatics tools from the NIH National Center for Integrative Biomedical Informatics (NCIBI, http://ncibi.org).

B4) Cross Web Application Integration through a Shared Database

Weijian Xuan, Terry Weymouth, Glenn Tarcea, Alex Ade, Zach Wright, Jim Cavalcoli, Barbara Mirel,
Hosagrahar V. Jagadish and Fan Meng

A large number of web applications are designed to deal with different aspects of biomedical data analysis requirements. A comprehensive data analysis task often requiring functions from multiple applications and enabling different web applications to work seamless with each other is a major challenge. While there are different ways to achieve various levels of cross-application integration, we choose to use a common database for the sharing of intermediate data from multiple applications.

Our solution offers a number of unique advantages including 1) central location of intermediate archiving and project management, including sharing of intermediate results among pre-defined group members. 2) shared functions for manipulating intermediate data such as gene/protein id mapping from different applications, union, intersect and subtraction of different data lists 3) different applications only need to communicate with a single API and database rather than implementing application-specific solution for each new application that needs integration.

We have created a web-based API for controlled access to the server supporting these functions: save data set, list all saved data sets, and review the content of a data set. NCIBI is developing a common set of core services for external and internal integration of tools. We also have a demo session for a workflow that involves Gene2MeSH, PubAnatomy and MiMI database developed in NCIBI.

B5) Bayesian Consensus Pathway Construction and Expansion using Microarray Gene Expression Data

Andrew P. Hodges, Zuoshuang Xiang, Peter Woolf, and Yongqun He

Signaling and regulatory pathways that guide gene expression have only been partially defined for most organisms. However, given the increasing number of microarray measurements, it may be possible to reconstruct such pathways and uncover missing connections directly from experimental data. Using a compendium of microarray gene expression data obtained from E. coli, we constructed a series of Bayesian network models for the reactive oxygen species (ROS) pathway as defined by EcoCyc. Three consensus Bayesian network models (large, medium, and small) were generated based on consensus stringency. While less stringent consensus models diverged from the literature model, more stringent consensus models with fewer genes better approximated the known ROS pathway. Networks at the three consensus levels were expanded to predict genes that enhance the Bayesian network model using an algorithm termed ‘BN+1’. Expansion of each of the three ROS-based networks predicted many stress-related genes and their possible interactions with other ROS pathway genes. For example, BN+1 expansion of the large network predicted a potential important role for uspE in regulating the ROS pathway and biofilm stress responses. The medium network expansion identified several genes (e.g., sra and yodD) and their possible interactions with other genes in the ROS pathway. The majority of known acid fitness island genes were recovered within the top 10 predicted genes by expansion of the small network containing gadE, gadW and gadX. The presently reported consensus and BN+1 expansion method is a generalized approach applicable to the study of other biological pathways and living systems.

B6) Visualizing and Analyzing Metabolomics Data Using MetScape Plugin for Cytoscape

Jing Gao, Alla Karnovsky, Glenn Tarcea, Christopher Beecher, Charles Burant, Terry Weymouth, James Cavalcoli,
Barbara Mirel, Brian Athey, Gilbert Omenn, and H.V. Jagadish

MetScape is a Cytoscape <http://cytoscape.org/> plugin which is used to visualize and analyze metabolomic data. It uses the data from Edinburgh Human Metabolic Network reconstruction (Ma et al., Mol Syst Biol. 2007; 3: 135). It allows users to visualize networks of compounds and display related information about reactions, enzymes, and pathways. Using a modified DynamicXpr plugin, a user can upload experimental metabolomic data and visualize the network over time or conditions with changing node color and/or node size. MetScape allows you to display and apply an entire metabolic network or pathways-specific sub-networks. You can apply a pathway filter to a network and create sub-networks from resulting subsets of compounds. In the future, we will integrate MiMI <http://mimi.ncibi.org/>, a database of gene, gene product, interaction and pathways, to show gene related information.

B7) PubAnatomy: Integrated Exploration of Biomedical Literature and Data in the Context of Mouse Brain Anatomy

Weijian Xuan, Manhong Dai, Josh Buckner, Barbara Mirel, Jean Song, Hongwei Dong, Brian Athey, Huda Akil,
Watson Stanley and Fan Meng

Effective literature and data explorations require the support of extensive background information at each decision points. Existing literature and data search solutions are designed mainly for the retrieval of the most relevant information but do not support the iterative exploration of information across multiple perspectives for scientific discovery very well. We believe that combining rich background information with interactive visualization of literature and data in their relevant biological context can greatly facilitate the connection of various pieces of information for new hypothesis development. PubAnatomy provides new ways to explore relationships among anatomical structures, pathophysiological processes, gene expression levels and protein-protein interactions by presenting Medline literature and experimental data in the context of mouse brain anatomy, gene network and anatomy ontology. It is an Adobe Flex 3-based Rich Internet Application that can be easily extended for third party data and analysis functions. URL for PubAnatomy is: http://brainarray.mbni.med.umich.edu/Brainarray/prototype/PubAnatomy.

B8) Customization and Integration of GenePattern Microarray Workflows into a Collaborative Environment

Felix Eichinger and Beth Kirschner

In a collaborative effort, the Kretzler Lab, the NCIBI developer team and MCIT are establishing a framework for effective and collaborative analysis of microarrays. To attain this goal, the workflow for microarray analyses, developed and refined over the years by the Kretzler lab, has been implemented in GenePattern. This ensures that the two main goals for a scientific workflow can be met:

1) Persistence, resulting in improved efficiency, reproducibility and the possibility to effectively communicate the current stage of the work between different members of the lab or to collaborators.

2) Flexibility, in adapting for the need to respond to the current data set while staying in a global analysis strategy and the capability to improve and extend the original workflow as new tools and techniques emerge.

With GenePattern it is possible to fulfill both of these requirements, the support for pipelines enhances effectiveness and reproducibility, while the modular architecture enabled the MCIT team to adapt and de novo implement analysis modules to meet our needs. In parallel, MCIT and the NCIBI developer team are joining forces in integrating GenePattern into the NCIBI portal system, creating the basis for effective collaboration through integrated and group-specific sharing of files and results.

B9) Qunits: Queried Units in Database Search

Arnab Nandi and H.V. Jagadish

When searching for information, a biologist is not expected to have the knowledge of a complex structured query language, or be familiar with the internal organization of the searched biological database. Every biologist has a mental map of the database, but often finds it hard to express or extract the information they are looking for. While the database contains records for "proteins" and "links", the biologist may be looking for "genomic information" or "pathways".

To solve this problem, we model the database as a collection of ''queried units’’ or ''Qunits’’. Qunits are atomic pieces of information that can be returned as a direct response to a biologist’s query. We describe methods for the automated generation of Qunits from query workloads and external evidence, and demonstrate that Qunits provide better quality of search results compared to traditional database search methods.

B10) Network Based Gene Set Analysis

George Michailidis

In this work, we consider the problem of assessing differential expression of entire gene sets in complex biological experiments. We propose a latent variable model that directly incorporates the underlying biological network structure. Subsequently, using the theory of mixed linear models we develop the necessary inference framework for addressing the task at hand. Several test procedures are examined and a network based method for testing changes in expression levels of gene sets, as well as the structure of the network is presented. The performance of the proposed methodology is assessed through a simulation study and applied to a number of real data sets.

B11) MimiWeb: Bringing Biological Data to the Web

V. Glenn Tarcea, Terry Weymouth, Zach Wright, Aaron Bookvich, Barbara Mirel, and H.V. Jagadish

Michigan Molecular Interactions (MiMI) allows users to easily explore an extensive database of protein interactions, pathways and genes. The data in MiMI comes from multiple external and internal data sources including DIP, BIND, and NCIBIs NLP literature mining efforts.

MimiWeb has been recently updated to include additional biological information including more Pathways data and Metabolomics data including compounds, reactions and pathways. The MimiWeb interface was streamlined and updated, and based on user feedback, to be more user-friendly. MimiWeb is a front-end website to a number of biologically important databases at NCIBI including MiMI (proteins, interactions, pathways), HumDB (metabolomics), Gene (Genes), and PubMed (Pubmed). MimiWeb interfaces and integrates with other members of the NCIBI suite of tools including GIN-IE and GIN-NA, MiMI Plugin for Cytoscape, MetScape, and others.

B12) Integrative Data Services And New Applications

Terry Weymouth, V. Glenn Tarcea, Weijian Xuan, Alex Ade, Zach Wright, Aaron Bookvich, Alla Karnovsky, Fan Meng,
Barbara Mirel, and H.V. Jagadish

NCIBI has been developing a common framework for tool integration. This framework brings together the NCIBI suite of tools by allowing users to seamlessly move their data across applications. NCIBI is developing other core services on top of this framework.

The Integrated Search and Filter Service brings search and filtering services to the large biological databases at NCIBI. The service will allow users to easily search across all the NCIBI data, and to filter based on current search context. Users will be able to easily find items and relationships of interest. As a part of this new service NCIBI is developing the IBIS front-end, an integrated search tool to provide a launching pad into the other NCIBI tools.

NCIBI has been working with i2b2 and their best of breed Hive clinical applications to integrate our two offerings together. Currently NCIBI and i2b2 have integrated the NCIBI TagMapper service into Hive. This service brings ICD9 to Gene mapping capabilities into the Hive client interface and provides a gateway for researchers using i2b2 to couple to the gene-based tools of NCIBI. Using these features users can find genes of interest starting with disease classifications. NCIBI and i2b2 are committed to expanding their partnership and continuing efforts to bring together our respective capabilities.

The Application Data Sharing Service (as described in the poster "Cross Web Application Integration through a Shared Database") allows users of NCIBI tools to save, upload and move data across applications. Users can pick an entry point that best serves their current need and then easily move their findings into other NCIBI tools. Currently, Gene2Mesh, and MimiWeb have been adapted to use this service. The new tools using this service are PubOnto, PubAnatomy, IBIS and the new Hive TagMapper plugin.

B13) Bioinformatics Framework for the Analysis and Interpretation of Metabolomic Data

Alla Karnovsky, Jing Gao, Glenn Tarcea, Christopher Beecher, Charles Burant, Barbara Mirel, H.V. Jagadish, and Gilbert S. Omenn

Metabolomics is a rapidly emerging field that is joining other high-throughput "omics", such as proteomics and transcriptional profiling. It promises to be a powerful systems approach for studying metabolic profiles pertinent to a variety of normal and disease states. Transcriptional profiling and proteomics have established data analysis tools; a metabolomics analytical toolkit is yet to be developed. We are creating a set of tools that will allow the user to examine experimental metabolomic data in the context of human metabolic networks and to combine it with other high throughput data.

A number of public sources contain information about human metabolic networks consisting of compounds, chemical reactions, pathways, enzymes and genes (KEGG, BIGG, EHMN). In this project, we used KEGG and EHMN data to trace the connections between metabolites and genes. Compounds, reactions, enzymes, genes and the relationships between them provide an initial framework for the analysis of metabolomic data. The Michigan Molecular Interactions database (MiMI) developed by the National Center for Integrative Biomedical Informatics (NCIBI) integrates protein interactions data from a number of public sources and thus can supply broader context for the analysis of the experimental data.

The data are stored locally in a Microsoft SQL Server database. They can be accessed via web-based query interfaces and Metscape, a new Cytoscape plug-in we are refining. The web tools allow the user to search the database for genes, reactions and pathways associated with a compound or gene. The web interface is linked to Cytoscape to build and display the network, using several customized layouts. Once the network has been displayed in Cytoscape, the user can import the normalized experimental metabolite data. Several data points can be loaded; the color and shape of the nodes are used to represent different parameters (Fig 1). The plug-in allows users to cycle through the data points sequentially, while the underlying network remains unchanged. The initial network is based on human metabolic reactions. It can be expanded to include protein-protein interactions data from the MiMI database.

B14) Design of Integrated Translational Bioinformatics Systems

Barbara Mirel, Benjamin J. Keller, Mor Peleg, and Russ Altman

Bioinformatics tools built for isolated tasks do not adequately support translational researchers’ analytical needs for hypothesis formulation. We address the design of integrated systems to support exploratory translational research. We examine user cognition and analysis patterns, their implications for system requirements, and biological knowledge representations for causal analysis.

- back to top -

NCIBI STUDENTS

C1) Mining Significant Gene to Metabolite Correlations in NCI-60 Data Set

Gang Su, Chris Beecher, Manhong Dai, Brian Athey and Fan Meng

Rapid development of Metabolomics profiling has generated high quality data which capture various downstream biological processes and fluctuations of cell metabolism. The study of Metabolomics not only opened new frontiers in molecular interaction and pathway analysis, but also promise enormous potential of revealing novel biological knowledge when integrated with other ‘omics’ data, such as Proteomics and Transcriptomics. The NCI-60 cell line data set commissioned by NCI include microarray, metabolome, proteomics, methylation, miRNA and genotype data for most of the NCI-60 cell lines (http://dtp.nci.nih.gov/) and it provides excellent raw material for the exploration of interplay among genome, transcriptome, metabolome and epigenetics. A preliminary analysis on NCI-60 Metabolomics data with NCI-60 microarray data demonstrated that 1) although Metabolomics data are incomplete and contain much higher noise level than Microarray, the metabolite cancer class classifiers can achieve comparable performance with microarray classifiers after appropriate sample filtering and feature selection and 2) by applying robust correlation computation methods, we managed to circumvent the issues of high noise level and extreme outliers in Metabolomics data to obtain significant correlations between enzyme gene-expression to metabolite profiles.

C2) Missing Protein Function Prediction for Graph Data

Anna Shaverdian and H.V. Jagadish

Determining protein function is one of the most important problems in proteomics. Recent high-throughput experiments have determined proteome-scale protein-protein interaction networks. High-throughput data collection is messy and often has missing data. Therefore, many proteomes are missing functional annotations of their proteins. We develop a pattern-matching based algorithm to predict values for these missing attributes and show that our technique does much better in precision, recall, and stability metrics than traditional techniques based on interpolation or averaging.

C3) GIN-IE: Interaction Extraction from the Literature

Arzucan Ozgur, Dragomir R. Radev, and Alex Ade

Besides the fact that there is a relationship between a pair of molecules, the context information such as the type and the directionality are also important. To extract the relationships and their context information we use the sentences, and their syntactic and dependency parse tree structures, which enables us to make syntax-aware inferences about the roles of the entities in a sentence. We also attempt handling speculations, which is a frequently used language phenomenon in biomedical scientific articles. While speculative information might still be useful for biomedical scientists, it is important that it is distinguished from the factual information. We investigate both machine learning-based approaches and rule-based approaches. While machine learning-based approaches achieve more balanced precision-recall performances, rule-based methods achieve higher precision at the expense of recall. High precision is an important requirement for most real-life applications. Therefore, we integrated our high precision dependency rule-based approach with a pipeline for processing the Pubmed updates on a daily basis. The extracted interactions are published as an RSS feed and included into the NCIBI data repository.

C4) GIN-NA: Gene Interaction Network Analysis

Arzucan Ozgur and Dragomir R. Radev

GIN-NA is a system for analyzing molecule interaction networks. The interaction networks are retrieved from the MiMI database, which integrates protein interactions from diverse biological data sources. Analyses of two types of networks are performed, namely molecule-specific networks and disease-specific networks. A molecule-specific network is the network of interactions in the neighborhood of a molecule. Besides the general network statistics such as average degree, power-law degree distribution, clustering coefficient, and shortest path statistics, GIN-NA ranks the molecules in the network based on graph centrality measures and second neighbor statistics. A disease-specific network is built by compiling a list of known disease genes and retrieving the interactions among these genes and their neighbors. We rank the genes based on their centrality in the network and hypothesize that genes that are central in the disease-specific gene interaction network are likely to be related to the disease. Currently, GIN-NA provides disease-specific networks for the following Driving Biological Problems: Prostate Cancer, Type 1 and Type 2 Diabetes, and Bipolar Disorder.

C5) Ontology Integration: From Annotations to Translations

Sirarat Sarntivijai, Yongqun He, Matthias Kretzler and Brian D. Athey

Biomedical ontology (bio-ontology) was first created out of the needs for systematic annotation. Most bio-ontologies residing in the OBO Foundry today were created de Facto at the laboratory of origin. Therefore, computing with logical reasoning embedded in individual bio-ontology can be challenging due to the divergence of individualism, especially when mapping multiple bio-ontologies for knowledge discovery. While such reusability and interoperability for knowledge transfer and discovery should be promoted, working with multiple bio-ontologies requires a sophisticated operating model that can overcome the issues of structural definition discrepancy of ontologies describing similar elements in the same domain, inconsistent and error-prone information within an ontology, and bridging across different information layers. We demonstrate that by mapping and integrating bio-ontologies of different biological layers from molecular genotype to molecular phenotype to clinical phenotype, bio-ontology processing plays an important role in knowledge discovery. Examples of use cases given in this study are integration of vaccine ontology in health care research and using ontology integration to identify key disease factors of Diabetic Nephropathy. The framework proposed here utilizes graph matching theory, natural language processing, and ontology alignment to create a novel approach of ontology integration that drives ontology processing forward from annotations to computations, to translations for the next-generation translational informatics.

C6) Mechanistic Bayesian Networks and PEBL

Abhik Shah and Peter Woolf

Here we present Mechanistic Bayesian Networks as an improved Bayesian technique for integrating knowledge and heterogeneous data to model complex biological phenomenon. We include a case study using gene expression and quantitative miRNA expression data to elucidate the significant interactions underlying the Epithelial-Mesenchymal transition common in development and cancer metastasis. We also highlight an open-source, published software system useful for this and other Bayesian analysis.

C7) Tracing Data Provenance of MiMI

Jing Zhang and H.V. Jagadish

When a scientist sees surprising (or interesting) data in MiMI, they need to know that they can trust this data. To support this need, MiMI provides data provenance information. Data provenance is important for an integrated database because it can tell where a tuple in the integrated database comes from and also why it was selected from the source data. This information helps the users to evaluate the data in the integrated database and also helps the developers to debug the integration workflow and correct the data both in the integrated database and the source databases.

In MiMI database, the integration is done in two steps: merging and loading. The first step merges external source files in different formats into uniformly formatted intermediate XML files; the second step loads the intermediate files into MiMI tables. Thus, one way to trace provenance of MiMI data is to first find out the contributing data items in the intermediate XML files for a given MiMI tuple, and then find out the contributing data items in the source files for the ones in the intermediate files. The first tracing step is done by executing tracing queries that are generated based on the loading script. The second tracing step can be done by providing different tracing codes for different source formats.

- back to top -

NCIBI COLLABORATORS

D1) An Overview of the Data Analytics for Medicine Using Semi-Supervised Learning (DAMSEL) Program

Barbara Beckerman, Robert Patton, Christopher Symons, April McMillan, Shaun Gleason, Ryan Kerekes, Vincent Paquit,
and Robert Nishikawa

Presently, knowledge discovery and cohesive decision-making capabilities for biomedical applications are hampered by significant gaps in technology for multi-modal data analytics. Historically, multi-modal data analysis and classification systems are developed and then tuned to answer a specific question for a specific application. This tuning results in a "stovepiped" solution that, in general, is not applicable to other domains. We are addressing this issue by developing a novel learning framework that can intelligently combine important data-rich resources and technologies, which in turn will leapfrog current analytical capabilities in a more comprehensive, flexible, and responsive computational environment. We call this framework Data Analytics for Medicine using Semi-supervised Learning (DAMSEL). When completed, DAMSEL will include two innovations: (1) A unique ability to combine data of disparate modalities that does not fundamentally rely on ensemble learning (which ignores individual associations among modalities) or the combining of features from different modalities (which can severely degrade performance if many of the features are redundant or irrelevant); and (2) A unique ability to utilize multi-modal data in an intuitive learning process that can result in significant improvements even when the result is applied to the analysis of data of a single modality (e.g., train using radiological images and associated text reports in order to build more effective image classifiers for use on images that have not been analyzed by a radiologist). DAMSEL is being developed using two disparate biomedical applications: mammography and traumatic brain injury.

The overall objectives for DAMSEL are to: (1) Develop an analytical, automated learning framework and tools for processing multi-modality medical data (text and images) for the purpose of data mining and assessment; (2) Improve performance, portability, and scalability of this computational framework by leveraging available intelligent software and hardware computing resources and adding functionality to the system; and (3) Validate the performance of the system on medical data in terms of both knowledge accuracy and overall system responsiveness and usefulness.

The purpose of this presentation is to provide an overview of the system, progress to date, and a plan forward.

D2) Natural Language Query in the Biomedical Domain Based on Cognition Search™

Saurabh Mendiratta, Radha Akella, Kathleen Dahlgre, and Elizabeth J. Goldsmith

With the increasing volume of scientific papers and heterogeneous nomenclature in the biomedical literature, it is apparent that an improvement over standard pattern matching available in existing search engines is required. Cognition Search Information Retrieval (CSIR) is a natural language processing (NLP) technology that possesses a large dictionary (lexicon) and large semantic databases, such that search can be based on meaning. Encoded synonymy, ontological relationships, phrases, and seeds for word sense disambiguation offer significant improvement over pattern matching. Thus, the CSIR has the right architecture to form the basis for a scientific search engine.

The architecture and databases of the software are such that multiple meanings of ordinary words and synonymy are resolved. CSIR™ NLP technology contains a broad semantic map of English based on word senses, their synonyms, hypernyms (higher nodes in an ontology) and sense contexts. The CSIR Indexer uses its NLP component to build a cognitive model of the text in which all of the concepts (word meanings) of a document are indexed as well as word strings. The NLP component relies on its dictionary, semantic map, and morphological and syntactic tags (fig.1). At search time, CSIR interprets the query meaning, and searches for this meaning in its concept index rather than using statistical word pattern matching. Therefore, the results are more complete and relevant. The Cognition Search engine uses downward reasoning synonymy and word morphology to improve recall. The software also uses word sense selection and phrasal parsing which improve precision. It is best used by asking a simple question that might be answered in MEDLINE textual data, such as "genetic correlates of alcoholism," "Oxidative stress in plants," "spectroscopy of amidohydrolases," or "Depression in aging."

Here we have carried out several projects to "teach" the CSIR lexicon biomedical language from curated web-based sources. Biomedical language possesses ontological relationships in the domain of proteins, genes, the Tree-of-Life and diseases. We selected those websites that encoded ontological relationships and vocabulary (terms, phrases and acronyms), along with their synonyms. Websites used in these projects include, The Alliance for Cell Signaling (AfCS) and database from the website http://medstract.med.tufts.edu, The Human Genome Nomenclature Consortium (HGNC), The United Medical Language System (UMLS) Meta-thesaurus, and The International Union of Pure and Applied Chemistry (IUPAC). We also introduced 60,000 IUPAC enzyme names and EC numbers. These were chosen because of the well-thought-out ontology that may be accessed with the EC numbers. We defined a top ontology (constructed by hand) for the biomedical domain that serves as a basis for capturing finer, more desired ontological nodes. The numbers of words or tokens present in MEDLINE but missing in the Cognition dictionary were counted. Unknown works with frequency greater than 100 were curated; there were only 800 of these. The remainder gave the frequency distribution shown in Fig. 2. As can be seen, capturing the words with frequency greater than 20 is desirable. At this writing, we have introduced most words with frequency greater than 50. (fig. 2).

Together with other ongoing lexical augmentations, the entire Cognition semantic map currently has 506,000 word stems, 536,000 senses, 75,000 synonym classes, 17,000 ambiguous word definitions and 7,564 ontological nodes in all language domains.

The resulting system was used to interpret MEDLINE abstracts. Meaning-based searches of MEDLINE abstracts yield high precision and high recall (estimated at >90%), where synonym information has been encoded. Fifty queries for MEDLINE were formulated in the areas of biochemistry, molecular biology and medicine. We compared Cognition’s retrievals with those of Pubmed (http://www.ncbi.nlm.nih.gov/sites/entrez/). We used the "relative recall" technique, in which full recall is estimated as the greatest number of retrievals achieved by either search engine. The queries used can be seen on Goldsmith’s Lab webpage http://hhmi.swmed.edu/Labs/bg/Cognition.

This effort is our first pass at introducing Biochemical and Molecular Biology terms into the CSIR lexicon. Other sources of new words will come from tracking user queries, evaluation of MEDLINE, and other curated databases. Efforts directed toward database integration may provide useful definitions, synonymy and ontology in biomedical domain. CSIR works equally well on full-text as on abstracts. This work contributes to precise interpretation of biomedical texts for research and data mining. The present implementation can be found at http://MEDLINE.cognition.com.

D3) GPLBrowse: Infrastructure for Interactive Browsing of Microarray Data

Kay A. Robbins and Cory Burkhardt

This project explores methods of accessing microarray and other types of biological data in more intuitive and interlinked ways. For example, NCBI GEO allows researchers to access microarray data by entering series or sample accession numbers, but the GEO keyword search does not always isolate samples of interest. The goal of this project is to develop web-based methods of browsing microarray data based on data properties, rather than using such local identifiers. GPLBrowse displays statistical properties of samples (such as mean versus standard deviation) and allows users to select samples based on these properties. Users can also view samples in context. For example a user might see how the samples of a particular series relate to those of the platform as a whole, or whether cancer samples have particular characteristics. Having selected a set of samples, the user can download the actual sample data in comma-separated value format. Users can enter keywords, and GPLBrowse immediately highlights the data matching the search criteria. Users can add these points to the current selection by pressing the +Select button to the right of the search box. GPLBrowse bases its search on all available content words in the sample and series metadata, not on a predefined list of keywords. In Zoom mode, users drag a box to designate the zoomed area. GPLBrowse rescales the display. We implemented GPLBrowse, using a split-client-server architecture based on AJAX (asynchronous JavaScript and XML) technology to achieve user-interactivity close to that of desktop applications. The GPLBrowse prototype currently supports 17 of the most popular microarray platforms, providing a convenient way for data miners to assemble large collections of microarray data for further study. This work was supported by NIH Research Centers in Minority Institutions 2G12RR1364-06A1.

D4) Identification of Metabolomic Lipid Markers of Prostate Cancer

Youping Deng, Puneet Bandi, Ruth Welti, and Venkata Thodima

Lipids have numerous critical biological functions which include membrane structure, energy storage, and signal transduction. On the other hand, lipids have been implicated for playing roles in several human diseases, including cancer. Prostate cancer, the leading cancer among American men, has been repeatedly found to be related to lipids and lipid metabolism. We have measured 343 lipid species belonging to 12 lipid classes across 1 normal, 10 benign prostatic hyperplasia (BPH) and 10 prostate cancer tissues using lipid profiling. We have found clear differences in lipid profiles between prostate cancer and non-tumor tissues. There are a total of 49 lipid species (p value<0.05) that are differentially distributed between these two types of tissues. We found PI lipid species had the most significantly high levels in prostate cancer. SM class species were likely to decrease in prostate cancer. We also identified obvious lipid differences among various pathological features of prostate cancer, such as Gleason Score. Moreover, we found that lipid profiles can be naturally used to classify cancer tissues from normal tissues and correlate well with various pathological statuses such as GS. Our data have demonstrated that lipidomics is a trustworthy and promising technology for finding novel makers for prostate cancer and shown that there is a clear lipid profile difference between prostate non-tumor and cancer tissues and among various prostate pathological features.

D5) Is It Possible to Map Mutations to their Reference Sequences? An Examination of the OMIM Database

Zuofeng Li, Xingnan Liu, Xiaoyan Zhang, and Hong Yu

Mutation data (i.e., substitution, insertion and deletion) is an important part of biomedical knowledge. However, we have found the current task of manually identifying variants relating to a particular phenotype or gene in mutation database to be tedious and, in many cases, prone to error. In this study, we explore approaches for automatically retrieving the mutation sequences for those mutations deposited in the OMIM database. The results indicate a possibility of about 40.0% to 46.9% that biologists will get unmapped or conflicting results by using the OMIM database to extract sequences, and this is mainly caused by the inconsistency of reference sequences used by authors in the original literature. We propose several approaches to finding sequences.

D6) Automated Selection of Genes for Translational Research on Comorbidity of Bipolar Disorder with Substance Abuse

Raphael D. Isokpehi, Sharon A. Lewis, Tolulola O. Oyeleye, Wellington K. Ayensu, and Tonya M. Gerald

Bipolar Disorder is a highly heritable mental illness. The global burden of bipolar disorder is complicated by its comorbidity with substance abuse. Several genome-wide linkage/association studies on Bipolar Disorder as well as Substance Abuse have focused on the identification and/or prioritization of candidate disease genes. A useful step for translational research of these identified/prioritized genes is to identify sets of genes that have particular kinds of publicly available data. Therefore, we have leveraged the availability of links to related resources in the Entrez Gene database to develop a web-based resource for selecting genes based on presence or absence in particular biological data resources. The utility of our approach is demonstrated using a set of 3,399 genes from multiple eukaryotes that have been studied in the context of Bipolar Disorder and/or Substance Abuse. A web resource to automate the selection of genes that contain certain database links is available at http://compbio.jsums.edu/bpd. A future development goal of the resource is to facilitate systems biology analyses by linking query outputs to the Michigan Molecular Interactions (MiMI) resource.

D7) Utilization of NCIBI Tools MiMI and Cytoscape to Determine an Indirect Interaction Between Estrogen and Dopamine Receptor Sub-types in the Brain

Krystal Dempsey, Damien Barnette, DeLauren McCauley, Breanna Fonville, Victoria Miller, Brooke Hudson, and
Tonya M. Gerald

Bipolar disorder (BPD) is a highly heritable, severe and chronic mental illness characterized by episodes of elation and high activity alternating with periods of low mood and low energy. These symptoms are often complicated by the co-morbidity in people suffering with narcotics and alcohol abuse. Co-morbidity of BPD and substance abuse include activation of common neurological pathways. Literature individually suggests estrogen and dopamine receptor sub-types participate in the regulation of the neurological pathways common to BPD and drugs of abuse. In the brain, estrogen has been shown to modulate mood while abnormal dopamine receptor signaling is implicated in response to substance abuse. Therefore, determination of any direct or indirect linkages between estrogen and dopamine receptor activation is necessary for further understanding the genetics behind BPD and substance abuse co-morbidity. In this study, we utilize the National Center for Integrative Biomedical Informatics (NCIBI) tools, Michigan Molecular Interactions (MiMI) and the Cytoscape MiMI Plugin (cytoscape) to demonstrate an indirect link between estrogen and dopamine receptor sub-types. MiMI allows users to explore all accessible databases of protein interactions, pathways and genes while cytoscape is an interactive visualization tool used for analyzing protein interactions and their biological effects. Funding Support: This work was supported in part by National Center for Integrative Biomedical Informatics (NIH U54DA021519), The Duke/NCCU STEM Partnership, HRD0411529, NCCU College of Science and Technology SEED Grant and NCCU Departments of Chemistry and Biology.

D8) Analysis of the Molecular Role of ANK3 in Bipolar Disorder

Sharon A. Lewis

The protein ankyrin 3 is encoded by the gene ANK3 located on chromosome 10q21. Ankyrins are peripheral membrane proteins thought to interconnect integral proteins with the spectrin-based membrane skeleton. Ankyrin 3 is an immunologically distinct gene product from ankyrins ANK1 and ANK2, and was originally found at the axonal initial segment and nodes of Ranvier of neurons that has been shown to regulate the assembly of voltage-gated sodium channels in the central and peripheral nervous systems. According to a genome wide association study, mutations in the ANK3 gene may be involved in the bipolar disorder, which is a chemical imbalance of neurotransmitters in the brain. This mental condition causes dramatic mood swings characterized by episodes of elation and high activity alternating with periods of low mood and low energy. Here we use bioinformatics to study mutations in this disorder. We want to investigate ANK3 to identify the single nucleotide polymorphism.

D9) Bipolar Disorder Sentence Database and LexRank Similarity Assessment of Sentences on Lithium Treatment

Matthew N. Anyanwu, Tolulola O. Oyeleye, Wellington K. Ayensu, Mehdi Pirooznia, and Raphael D. Isokpehi

The extraction of facts and information from unstructured natural language text such as PubMed abstracts is increasingly recognized as a crucial step for translational biomedical and behavioral research. Bipolar disorder (BPD) is a highly heritable, severe and chronic mental illness characterized by episodes of elation and high activity alternating with periods of low mood and low energy. This condition is less prevalent but more persistent and more impairing than major depressive disorder (MDD). A search with the text "Bipolar Disorder" of the PubMed using the MiSearch Adaptive PubMed Tool revealed over 23,000 citations. We assume that these citations contain descriptors relevant to uncovering novel insights into various aspects of Bipolar Disorder. We have developed a software tool that splits the Title and Abstract text in PubMed XML files into sentences. We implemented this software on a set of PubMed citations annotated with the MeSH term "Bipolar Disorder". A total of 125,988 sentences were obtained and can be queried using keywords and PubMed identifiers at http://compbio.jsums.edu/bpdsd. The database was developed to support a study on the Genetic Predisposition of African-American Women to Bipolar Disorder and Substance Abuse. Thus, we designed use cases to identify sentences and then abstracts that could guide further studies. Searches with the following keywords: female, genetic, woman, and women retrieved 1404, 2730, 267 and 1808 sentences respectively. These subsets of sentences were further analyzed for co-occurrence of descriptors such as alcohol, comorbid and substance. We used LexRank (a National Center for Integrative Biomedical Informatics [NCIBI] Tool) to determine the similarity of sentences that were from published bipolar disorder case reports on women that included lithium treatment. Filtering the LexRank Graph with the Cosine and Salience measures resulted in five significantly similar sentences including those from abstracts on case reports of side-effects of long-term lithium treatment in women (PMID:15730030; PMID:11050737). Further work will include improving the sentence splitting tool that we have developed and implementing LexRank algorithm to summarize results from queries.

D10) Extending the Investigation of GRIN2B, a Prioritized Gene for Predisposition to Bipolar Disorder, using NCIBI Tools

Wellington K. Ayensu and Raphael D. Isokpehi

We had observed from a microarray study on the changes in gene expression in HepG2 cell line exposed to low levels of mercury that the Affymetrix probe set 213764_s_at mapped to chromosome 12p13.1-p12.3 was upregulated. The protein-encoding gene Glutamate Receptor, Ionotropic, N-methyl D-aspartate (NMDA) 2B (GRIN2B) is also located on human chromosomal region 12p and has been prioritized in population-based studies as a candidate gene for predisposition to bipolar disorder. GRIN2B encodes the NR2B subunit of the NMDA receptor. The NMDA receptor activation leads to a calcium influx into the post-synaptic cells, a signal thought to be crucial for the induction of NMDA-receptor dependent Long Term Potentiation (LTP) and Long Term Depression (LTD). Thus over-expression of these receptors can account for "excitotocity" of manic phases of bipolar diseases typical of Type I Bipolar Disorder. The objectives of our study were to determine Medical Subject Heading (MeSH) qualifiers as well as protein interactions associated with GRIN2B. We used the NCIBI Gene2Mesh Tool (http://gene2mesh.ncibi.org) to determine MeSH terms significantly associated in PubMed abstracts. The NCIBI NetBrowser Tool was used to visualize the interactions of proteins with GRIN2B available from the Michigan Molecular Interactions (MiMI). A total of 49 significant MeSH headings were found matching the human gene symbol "GRIN2B". The associated MeSH Qualifiers were etiology, cytology, genetics, metabolism, pharmacology and physiology. Furthermore, the Gene2Mesh analysis revealed Alcoholism and Ethanol as significant MeSH Headings. Forty-two protein interactions were stored in MiMI for GRIN2B and we classified them based on the type of interaction information (bidirectional, in vitro and in vivo). In summary, we have used NCIBI tools to extend our investigation on GRIN2B for understanding genetic predisposition to comorbid biopolar disorder and substance abuse.

D11) Functional Insights into Universal Stress Proteins of Pseudomonas

Demareo J. Webb, Wellington K. Ayensu, Hari H.P. Cohly, and Raphael D. Isokpehi

Pseudomonas species are gram negative bacteria that are ubiquitous in nature and are able to thrive at extreme environments. The type species of the Pseudomonas genera is Pseudomonas aeruginosa, an opportunistic pathogen that affects patients with cystic fibrosis, the most common lethal genetic disorder in the United States to be identified in childhood. We are interested in the contribution of genes encoding proteins with the universal stress protein domain to survival of Pseudomonas species in environmental stress conditions such as the anaerobic airways found in cystic fibrosis. Proteins with the Universal Stress Protein domain (PFAM:00582) are known to provide cells with the ability to respond to environmental stresses such as nutrient starvation, drought, high salinity, extreme temperatures, and exposure to toxic chemicals. There are at least 15 completely sequenced Pseudomonas genomes providing opportunities for comparative analysis of universal stress proteins. We have determined that four of the 6 predicted universal stress proteins of the soil-inhabiting Pseudomonas fluorescens PfO-1 were clustered in the same genomic region. There was also evidence from the Subsystem Annotation of the National Pathogen Data Resource (NMPDR) that the genes for the stress proteins in close genomic proximity are functionally coupled with a transcriptional regulator and an enzyme. The transcriptional regulator is predicted to have a role in oxidative stress while the enzyme is important for cell growth and development. Since the universal stress proteins of most Pseudomonas species have not been characterized for function, we sought to determine known protein interactions associated with the universal stress proteins (UspA, UspB, UspC, UspD, UspE, UspF, UspG) of Escherichia coli. We used the Michigan Molecular Interactions (MiMI) to obtain protein interactions from a variety of sources. The only E. coli Usp with compiled protein interactions was UspG. The 16 protein interactions found included interaction with N-acetylglucosaminyl transferase (murG) annotated to function in peptidoglycan biosynthetic process. Future work will involve integrative analysis of the Pseudomonas Usps to uncover novel insights into their function in disease and extreme environmental conditions.

D12) Modeling Arenavirus Nucleocapsid and Z Protein Structures

Aristotle M. Mannan, Eric R. May, Roger S. Armen, Ranjan V. Mannige, and Charles L. Brooks III

The Arenaviridae family of viruses, responsible for neurological disease and hemorrhagic fever, is transmitted to humans via rodents. Over 20 different strains have been identified and phylogenetically classified since the first outbreak of the virus in 1933 and new strains are continuing to develop. It is known that Arenaviridae are enveloped and spherical, meaning that they are made up of a single protein copied numerous times, and contain two segments of single stranded RNA [1]. There is no structural information about the major nucleocapsid protein (NP) and other proteins associated with the virus capsid structure. This lack of structural data has hindered the identification of potential drug targets and the development of effective drugs. Currently there are no vaccines or FDA approved drugs for Arenavirus infections.

The zinc-finger-like protein (Z) is known to interact with NP to induce budding, the process of viral proliferation. In order to better understand how NPs interact in the context of the spherical virus shells as well as with Z, structures of these two proteins were predicted using homology modeling methods. Amino acid sequences for the Old World Lymphocytic Choriomeningitis Virus (LCMV) and the New World Tacaribe Virus, two strains commonly studied in experimental labs, were used for prediction of tertiary structure. Models for Z were constructed from templates obtained through PSI-Blast and compared to the tertiary structure predictions from web servers. In addition, tertiary structures of all known virus capsids described in Viper DB [2] were used as potential templates for homology modeling of NP. Having constructed models of NP and Z, we identified putative protein-protein interaction sites, which may represent a better anti-viral drug target than the interaction of Z with human proteins.

[1] Neuman, B.W., B.D. Adair, J.W. Burns, R.A. Milligan, M.J. Buchmeier, M. Yeager, Complementarity in the Supramolecular Design of Arenaviruses and Retroviruses Revealed by Electron Cryomicroscopy and Image Analysis, J. Virol. 2005, Vol. 79, 3822-3830

[2] Shephed, C.M., I.A. Borelli. G. Lander. P. Natarajan, V. Siddavanhalli, C. Bajaj, J.E. Johnson, C.L. Brooks, V.S. Reddy, VIPERdb: a relational database for structural virology, Nucleic Acids Res. 2006, Vol. 34, D386-D389

D13) Microarray based analysis of regulatory networks of methylation sensitive genes
in T cells

Anura Hewagama, Dipak Patel, and Bruce C. Richardson

Eukaryotic gene expression requires not only transcription factor activation but also regional modification of chromatin structure into a transcriptionally permissive configuration through epigenetic mechanisms, including DNA methylation and histone modifications. The methylation of dC bases in CpG islands promotes a repressive chromatin structure inaccessible to transcription factors, suppressing gene expression. Hypomethylation of regulatory sequences correlates with active transcription. T cell DNA hypomethylation has been implicated in the pathogenesis of idiopathic and drug-induced human lupus. However, the genes and the regulatory pathways affected by DNA hypomethylation are largely unknown. We report here a microarray analysis of T cell gene expression after treatment with the DNA methyltransferase inhibitor 5-aza-2’-deoxycytidine (5-aza). PBMC isolated from 3 men and 3 women were stimulated with PHA for 24hrs. T cells were isolated and cultured with or without 5-aza for 3 days, then restimulated or not for 6 hr with PMA + ionomycin. Microarray data analysis was performed using Genomatix program (http://www.genomatix.de/). We identified 165 and 215 5-Aza responsive genes from unrestimulated and restimulated cells respectively. Among these differentially expressed genes, 16 transcription factor genes were upregulated upon 5-AZa treatment. Several factors linked to T cell gene regulation: activating transcription factor 5 (ATF5), polymerase (RNA) II (DNA directed) polypeptide (POLR2K), cellular repressor of E1A-stimulated genes 1(CREG1), cAMP responsive element modulator (CREM) and interleukin enhancer binding factor 3 (ILF3). The functional significance of the differentially expressed genes was estimated using gene ontology (GO) analysis. Immune system process (GO: 0002376), response to stimulus (GO: 0050896), apoptosis (GO: 0006915) and cell communication (GO: 0007154) were among the most significantly overrepresented groups. Pathway connections via regulatory networks linked inflammatory response pathway, CD40 signaling, apoptosis, TGF beta signaling and matrix metalloproteinases to the differentially expressed genes of restimulated T cells.

- back to top -