YEARS

-

AUTHORS

Calvin A Johnson

TITLE

Informatics, Machine Learning & Biomedical Data Science

ABSTRACT

The Informatics, Machine Learning, and Biomedical Data Science, which operates within the High Performance Computing and Informatics Office (HPCIO), Division of Computational Bioscience of CIT, is collaborating with NIH investigators to build a critical mass in text and numerical analytics that is envisioned to encompass a number of pertinent and related disciplines in biomedical research including semantic interoperability, computational linguistics, text and data mining, natural language processing, machine learning, longitudinal analysis, and visualization. The program is intended to foster advances in critical domains at NIH including biomedical and clinical informatics, translational research, genomics, proteomics, systems biology, big data analysis, and portfolio analysis. In 2015, collaborative efforts in support of these goals included the following: -In collaboration with NIA, we are applying machine learning and visualization techniques on large biological datasets to discover novel patterns of functional gene or protein interactions as related to aging. In this collaboration, we are developing a machine learning method that models the temporal nature of the longitudinal clinical data to predict the progression of Amyotrophic lateral sclerosis. Such machine learning method may also work well in prediction of high-dimensional time-series genomic data. - In collaboration with NIAID, HPCIO has released HT JoinSolver(R), a new application capable of analyzing V(D)J recombination in thousands of immunoglobulin gene sequences produced by high throughput sequencing. - HPCIO is working with NCI to develop methodologies to incorporate occupational risk factors into epidemiological models. Novel classifiers are being developed to classify free text job descriptions into the 840 codes of the 2010 U.S. Standard Occupational Classification System. Agreement between our classification system and expert coders is measured using SOC code agreement and exposure agreement after applying CANJEM, a job-exposure matrix of over 250 exposure agents developed by Jrme Lavou at the University of Montreal. - In collaboration with the Membrane Transport Biophysics Section NINDS, HPCIO is 1) developing a tool to accurately identify the boundaries of the lysosomes in fluorescence microscopy and 2) using the fluorescence ration to measure lysosomal pH within each organelle for better understanding of the lysosomal pH regulation. - HPCIO is collaborating with NIAID to study immune cell infiltration in various tissue samples from patients with metabolic diseases. Using systems-based approaches, we examine gene expression and genotyping data to understand the roles and interactions of different immune cells in response to metabolic disease signals and their associations to intervention outcomes and other phenotypes. - A freely available plasmid database that is interoperable with popular freeware is currently being developed for the NIDA Optogenetics and Transgenic Technology Core. The Plasmid Manager offers a versatile yet simple platform for scientists to store and analyze their plasmid data. Motivated by the need for a more comprehensive approach to archiving plasmid data, the database platform is enriched with numerous components beyond the repository, serving as an informatics platform designed to enhance the efficiency and analytic capabilities of scientists. - In collaboration with CSR, HPCIO is applying text analytics to provide CSR leadership with evidence-based decision support in evaluation of the grant review process. A Web-based automated referral tool, called ART, is being developed to help PIs and SROs to identify the most relevant study section(s) or special emphasis panel(s) based on the scientific content of an application. In addition, HPCIO is analyzing text from quick feedback surveys on peer review. This effort includes evaluatinng a pilot study to evaluate the feasibility of analyzing free text from peer reviewers on their perception of the study section quality. If successful, the pilot results will be used to as initial input for a full-scale implementation. - The Human Salivary Protein wiki has been made available online on a community-based Web portal developed by HPCIO, in collaboration with NIDCR, to enable scientists to add their own research data, share results, and discover new knowledge. This is a major step towards the discovery and use of saliva biomarkers to diagnose oral and systemic diseases. - In collaboration with the Office of Data Analysis Tools and Systems, NIH Office of the Director, HPCIO has been developing a standard database update pipeline for NIH Topic Maps, originally developed by Dr. Ned Talley of NINDS. We are evaluating whether this pipeline can be incorporated into a stable hosted instance. - As high-throughput next-generation sequencing (NGS) technology plays an important role in systematically identifying novel cancer driver mutations in genome-wide surveys, NGS data generation is rapidly increasing, currently accumulating at a rate of several terabytes of data every month at the Lymphoid Malignancies Section of NCI. We need to enhance database platforms in anticipation of even more growth in the near future. The recent emergence of Hadoop/NoSQL systems (e.g., Hbase) has provided an alternative platform for querying large-scale genomic data. In addition, relational database providers have been enhancing their offerings to include products for explicitly distributing data across multiple nodes (e.g., Postgres XL). We have sought to integrate these technologies with current relational database systems (e.g., Postgres) to improve performance in a parallel or distributed manner. The goal of our effort has been to investigate the potential of these distributed platforms in storing and querying the large volumes of data that NCI accumulates, thereby augmenting their current analytical capabilities. - Based on its experience in building novel models for classifying research grants and projects, HPCIO has collaborated with DPCPSI/OD and other ICs to develop the Portfolio Learning Tool, a comprehensive classification workflow system that will allow users to select from multiple classification algorithms, feature spaces, and training regimes, to build and run their own classifiers. HPCIO has developed an augmented support vector machine (SVM) that augments a training set by sampling from a corpus of unknowns and runs a large ensemble on various samples of this augmented space. The results obtained from this classifier suggest that, when coupled with an effective annotation strategy, such a classifier can be quite effective at categorizing a research portfolio.

FUNDED PUBLICATIONS

  • The genetic association database.
  • Disease and phenotype gene set analysis of disease-based gene expression in mouse and human.
  • Early origin and recent expansion of Plasmodium falciparum.
  • HTJoinSolver: Human immunoglobulin VDJ partitioning using approximate dynamic programming constrained by conserved motifs.
  • Analysis of somatic hypermutation in X-linked hyper-IgM syndrome shows specific deficiencies in mutational targeting.
  • The Genetic Association Database
  • Reconstruction for time-domain in vivo EPR 3D multigradient oximetric imaging--a parallel processing perspective.
  • HTJoinSolver: Human immunoglobulin VDJ partitioning using approximate dynamic programming constrained by conserved motifs
  • Delineation of a conserved arrestin-biased signaling repertoire in vivo.
  • How to use: Click on a object to move its position. Double click to open its homepage. Right click to preview its contents.

    Download the RDF metadata as:   json-ld nt turtle xml License info


    25 TRIPLES      15 PREDICATES      25 URIs      7 LITERALS

    Subject Predicate Object
    1 grants:1f6d4ff50fb7836c1ff00478cc425aa3 sg:abstract The Informatics, Machine Learning, and Biomedical Data Science, which operates within the High Performance Computing and Informatics Office (HPCIO), Division of Computational Bioscience of CIT, is collaborating with NIH investigators to build a critical mass in text and numerical analytics that is envisioned to encompass a number of pertinent and related disciplines in biomedical research including semantic interoperability, computational linguistics, text and data mining, natural language processing, machine learning, longitudinal analysis, and visualization. The program is intended to foster advances in critical domains at NIH including biomedical and clinical informatics, translational research, genomics, proteomics, systems biology, big data analysis, and portfolio analysis. In 2015, collaborative efforts in support of these goals included the following: -In collaboration with NIA, we are applying machine learning and visualization techniques on large biological datasets to discover novel patterns of functional gene or protein interactions as related to aging. In this collaboration, we are developing a machine learning method that models the temporal nature of the longitudinal clinical data to predict the progression of Amyotrophic lateral sclerosis. Such machine learning method may also work well in prediction of high-dimensional time-series genomic data. - In collaboration with NIAID, HPCIO has released HT JoinSolver(R), a new application capable of analyzing V(D)J recombination in thousands of immunoglobulin gene sequences produced by high throughput sequencing. - HPCIO is working with NCI to develop methodologies to incorporate occupational risk factors into epidemiological models. Novel classifiers are being developed to classify free text job descriptions into the 840 codes of the 2010 U.S. Standard Occupational Classification System. Agreement between our classification system and expert coders is measured using SOC code agreement and exposure agreement after applying CANJEM, a job-exposure matrix of over 250 exposure agents developed by Jrme Lavou at the University of Montreal. - In collaboration with the Membrane Transport Biophysics Section NINDS, HPCIO is 1) developing a tool to accurately identify the boundaries of the lysosomes in fluorescence microscopy and 2) using the fluorescence ration to measure lysosomal pH within each organelle for better understanding of the lysosomal pH regulation. - HPCIO is collaborating with NIAID to study immune cell infiltration in various tissue samples from patients with metabolic diseases. Using systems-based approaches, we examine gene expression and genotyping data to understand the roles and interactions of different immune cells in response to metabolic disease signals and their associations to intervention outcomes and other phenotypes. - A freely available plasmid database that is interoperable with popular freeware is currently being developed for the NIDA Optogenetics and Transgenic Technology Core. The Plasmid Manager offers a versatile yet simple platform for scientists to store and analyze their plasmid data. Motivated by the need for a more comprehensive approach to archiving plasmid data, the database platform is enriched with numerous components beyond the repository, serving as an informatics platform designed to enhance the efficiency and analytic capabilities of scientists. - In collaboration with CSR, HPCIO is applying text analytics to provide CSR leadership with evidence-based decision support in evaluation of the grant review process. A Web-based automated referral tool, called ART, is being developed to help PIs and SROs to identify the most relevant study section(s) or special emphasis panel(s) based on the scientific content of an application. In addition, HPCIO is analyzing text from quick feedback surveys on peer review. This effort includes evaluatinng a pilot study to evaluate the feasibility of analyzing free text from peer reviewers on their perception of the study section quality. If successful, the pilot results will be used to as initial input for a full-scale implementation. - The Human Salivary Protein wiki has been made available online on a community-based Web portal developed by HPCIO, in collaboration with NIDCR, to enable scientists to add their own research data, share results, and discover new knowledge. This is a major step towards the discovery and use of saliva biomarkers to diagnose oral and systemic diseases. - In collaboration with the Office of Data Analysis Tools and Systems, NIH Office of the Director, HPCIO has been developing a standard database update pipeline for NIH Topic Maps, originally developed by Dr. Ned Talley of NINDS. We are evaluating whether this pipeline can be incorporated into a stable hosted instance. - As high-throughput next-generation sequencing (NGS) technology plays an important role in systematically identifying novel cancer driver mutations in genome-wide surveys, NGS data generation is rapidly increasing, currently accumulating at a rate of several terabytes of data every month at the Lymphoid Malignancies Section of NCI. We need to enhance database platforms in anticipation of even more growth in the near future. The recent emergence of Hadoop/NoSQL systems (e.g., Hbase) has provided an alternative platform for querying large-scale genomic data. In addition, relational database providers have been enhancing their offerings to include products for explicitly distributing data across multiple nodes (e.g., Postgres XL). We have sought to integrate these technologies with current relational database systems (e.g., Postgres) to improve performance in a parallel or distributed manner. The goal of our effort has been to investigate the potential of these distributed platforms in storing and querying the large volumes of data that NCI accumulates, thereby augmenting their current analytical capabilities. - Based on its experience in building novel models for classifying research grants and projects, HPCIO has collaborated with DPCPSI/OD and other ICs to develop the Portfolio Learning Tool, a comprehensive classification workflow system that will allow users to select from multiple classification algorithms, feature spaces, and training regimes, to build and run their own classifiers. HPCIO has developed an augmented support vector machine (SVM) that augments a training set by sampling from a corpus of unknowns and runs a large ensemble on various samples of this augmented space. The results obtained from this classifier suggest that, when coupled with an effective annotation strategy, such a classifier can be quite effective at categorizing a research portfolio.
    2 sg:fundingAmount 15263326.0
    3 sg:fundingCurrency USD
    4 sg:hasContribution contributions:f639d8a877eef38a9033622ded2a95a2
    5 sg:hasFieldOfResearchCode anzsrc-for:08
    6 anzsrc-for:0801
    7 anzsrc-for:0806
    8 sg:hasFundedPublication articles:121cd7ac2c9b4a6199b3854b3a55f0ed
    9 articles:1773cb9cdf910a5c69733b47719bb831
    10 articles:813a1c9b1ba2db9fc3e011921f332e57
    11 articles:958c54ee92ff26575291befb0380ddc6
    12 articles:bd57653178cf3b3f3e6354fab166cc36
    13 articles:d8173cc56ea15eb5d3505d509e464ba7
    14 articles:d968dfc70a618e23900bfc6504b077cb
    15 articles:f86ef21195cc6e39364d94104f27195b
    16 articles:fad9174fc624a5c59619f06c528496f3
    17 sg:hasFundingOrganization grid-institutes:grid.410422.1
    18 sg:hasRecipientOrganization grid-institutes:grid.410422.1
    19 sg:language English
    20 sg:license http://scigraph.springernature.com/explorer/license/
    21 sg:scigraphId 1f6d4ff50fb7836c1ff00478cc425aa3
    22 sg:title Informatics, Machine Learning & Biomedical Data Science
    23 sg:webpage http://projectreporter.nih.gov/project_info_description.cfm?aid=9146134
    24 rdf:type sg:Grant
    25 rdfs:label Grant: Informatics, Machine Learning & Biomedical Data Science
    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular JSON format for linked data.

    curl -H 'Accept: application/ld+json' 'http://scigraph.springernature.com/things/grants/1f6d4ff50fb7836c1ff00478cc425aa3'

    N-Triples is a line-based linked data format ideal for batch operations .

    curl -H 'Accept: application/n-triples' 'http://scigraph.springernature.com/things/grants/1f6d4ff50fb7836c1ff00478cc425aa3'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'http://scigraph.springernature.com/things/grants/1f6d4ff50fb7836c1ff00478cc425aa3'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'http://scigraph.springernature.com/things/grants/1f6d4ff50fb7836c1ff00478cc425aa3'






    Preview window. Press ESC to close (or click here)


    ...