YEARS

-

AUTHORS

James D Malley

TITLE

Statistical Learning for Biomedical Data

ABSTRACT

This project studies statistical learning machines as applied to personalized biomedical and clinical probability estimates for outcomes, patient-specific risk estimation, synthetic features, noise detection, feature selection. In more detail: 1. Probability machines can generate personalized probability predictions for multiple phenotypes and outcomes, such as tumor, not tumor. These methods fully supercede simple classification methods, those that only generate zero or one predictions. The distinction is this: A pure classification scheme will produce the same prediction for these two outcomes: an 85% chance of tumor, and a 58% chance of tumor. These outcomes can be expected to have distinct and critical patient level evaluations, prognosis, and treatment plans, specific to patient subgroups. A probability machine produces provably consistent probability outcomes (85% or 58%) for each patient, and does so using any number or type of predictors, with no model specification required, and arbitrary correlation structure in the features. Thus, a probability machine is a significantly better use of the available information in the data. If a specific, classical analysis model such as a logistic regression scheme is assumed to be exactly correct for the data, then the probability machine will provide estimates than can fully support, question, or challenge the validity of the logistic regression. Moreover, no specified interaction terms are required to be defined by the researcher: the probability machine is provably consistent in the absence of any user-input interaction terms or so-called confounders. 2. Risk machines are based on multiple probability machines, and counterfactual detection engines. They provide provably consistent estimates of all manner of risk effects estimates: log odds, risk ratios, risk differences. Most critically, they provided patient-specific risk estimates. They are entirely model free and can use any number or type of predictors, and allow for arbitrary, unspecified correlation structure in the features. If a specific, classical analysis model such as a logistic regression scheme is known to be correct for the data at hand, then the risk machine will provide estimates than can fully support, question, or challenge the validity of the logistic regression. That is, the risk machine can provide a fully model free validation of a smaller parametric model, if correct, by generating risk effects sizes that agree with the logistic regression model parameters. As with any probability machine, no user-input interaction terms are required: the risk machines can, indeed, be used for interaction detection, in the absence of any parametric model. 3. The introduction of synthetic features considerably expands the classical notion of features or predictors, by allowing the research to assemble new sets of features or networks and allowing a statistical learning machine to then process the data using both original and synthetic features. Typically, a small linear parametric model is invoked to remove the effects of confounders, such as age, gender, population stratification, or more. Unless the model is known to be exactly correct, this treatment of confounders is certain to be in error. The use of synthetic features is a fully nonparametric alternative approach to this problem. 4. Crowd machines can optimally combine the results of any number of learning machines, in a model-free scheme. They can also relieve the researcher from having to optimally set any learning machine tuning parameters. The results of any learning machine analysis therefore become independent of any required tuning parameters, such as support vector learning machine kernel, any details of a neural net. The crowd machine combines detection from any number of machines, specifically allowing for one or another machine to be optimal for some subset of patients and or some subset of features. The crowd machine is not a simple ensemble, or committee or voting scheme. It has been shown to be provably optimal as a statistical data analysis scheme, at least as good as the best machine in the collection. It does not require naming a winner among the collection of machines. Indeed the search for such winners is easily shown to be suboptimal, for example when a machine is best for some portion of the data but not so for other subsets of the dad. 5. Probability machines can be used for feature selection using the new and validated notion of recurrency. No linear ranking of features is ever necessary. In fact, simple examples show that such linear ranking can be inconsistent and contradictory. Features that may be only weakly predictive can be reliably detected using the method of recurrency. That is, the data may have no main effects, no single features that are critical for estimating the personalized probability for an outcome, or the patient-specific risk effect sizes. Yet multiple subsets of features, none strongly predictive, may jointly provide excellent probability and risk estimates. The method of recurrency, and locates these features in the data. 6. Similarly, the method of recurrency can be used to remove features that are clearly noise and that only obscure the truly predictive features in the data. 7. Probability and risk machines can jointly provide nonparametric detection of interacting features. Such detection--entanglement maps--can be undertaken in a fully model-free environment. Simple examples show that interactions among features are often not recovered using the pair-wise products of these features in any model. Entanglement mapping has immediate application to genome-wide interaction detection, even when no single genetic marker, any SNP say, is by itself a predictive feature.

FUNDED PUBLICATIONS

  • O brave new world that has such machines in it.
  • Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory.
  • SCORHE: A novel and practical approach to video monitoring of laboratory mice housed in vivarium cage racks
  • Practical experiences on the necessity of external validation.
  • Synthetic learning machines.
  • Multiple neural network classification scheme for detection of colonic polyps in CT colonography data sets.
  • Evaluation of random forests performance for genome-wide association studies in the presence of interaction effects.
  • Comparative validation of the D. melanogaster modENCODE transcriptome annotation.
  • O brave new world that has such machines in it
  • A system-level pathway-phenotype association analysis using synthetic feature random forest.
  • The clinical phenotypes of the juvenile idiopathic inflammatory myopathies.
  • Support vector machines committee classification method for computer-aided polyp detection in CT colonography.
  • Performance of random forests and logic regression methods using mini-exome sequence data.
  • The disconnect between classical biostatistics and the biological data mining community.
  • The effect of stocking densities on reproductive performance in laboratory zebrafish (Danio rerio).
  • Immunogenetic differences between Caucasian women with and those without silicone implants in whom myositis develops.
  • Predictor correlation impacts machine learning algorithms: implications for genomic studies.
  • Patient-centered yes/no prognosis using learning machines.
  • Risk estimation using probability machines
  • The limits of p-values for biological data mining.
  • Computer-assisted detection of colonic polyps with CT colonography using neural networks and binary classification trees.
  • Short-term prediction of mortality in patients with systemic lupus erythematosus: classification of outcomes using random forests.
  • Looking for childhood-onset schizophrenia: diagnostic algorithms for classifying children and adolescents with psychosis.
  • HLA polymorphisms in African Americans with idiopathic inflammatory myopathy: allelic profiles distinguish patients with different clinical phenotypes and myositis autoantibodies.
  • Clinical and immunogenetic prognostic factors for radiographic severity in ankylosing spondylitis.
  • Brief review of regression-based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience.
  • Risk estimation using probability machines.
  • The disconnect between classical biostatistics and the biological data mining community
  • Immunogenetic risk and protective factors for the idiopathic inflammatory myopathies: distinct HLA-A, -B, -Cw, -DRB1, and -DQA1 allelic profiles distinguish European American patients with different myositis autoantibodies.
  • Innovation is often unnerving: the door into summer.
  • Synthetic learning machines
  • SCORHE: a novel and practical approach to video monitoring of laboratory mice housed in vivarium cage racks.
  • First complex, then simple.
  • First complex, then simple
  • The limits of p-values for biological data mining
  • Evaluating interventions to improve gait in cerebral palsy: a meta-analysis of spatiotemporal measures.
  • Innovation is often unnerving: the door into summer
  • Using multivariate machine learning methods and structural MRI to classify childhood onset schizophrenia and healthy controls.
  • How to use: Click on a object to move its position. Double click to open its homepage. Right click to preview its contents.

    Download the RDF metadata as:   json-ld nt turtle xml License info


    55 TRIPLES      15 PREDICATES      55 URIs      7 LITERALS

    Subject Predicate Object
    1 grants:1e931ce8c52270899ce00d0a3ce6f8c7 sg:abstract This project studies statistical learning machines as applied to personalized biomedical and clinical probability estimates for outcomes, patient-specific risk estimation, synthetic features, noise detection, feature selection. In more detail: 1. Probability machines can generate personalized probability predictions for multiple phenotypes and outcomes, such as tumor, not tumor. These methods fully supercede simple classification methods, those that only generate zero or one predictions. The distinction is this: A pure classification scheme will produce the same prediction for these two outcomes: an 85% chance of tumor, and a 58% chance of tumor. These outcomes can be expected to have distinct and critical patient level evaluations, prognosis, and treatment plans, specific to patient subgroups. A probability machine produces provably consistent probability outcomes (85% or 58%) for each patient, and does so using any number or type of predictors, with no model specification required, and arbitrary correlation structure in the features. Thus, a probability machine is a significantly better use of the available information in the data. If a specific, classical analysis model such as a logistic regression scheme is assumed to be exactly correct for the data, then the probability machine will provide estimates than can fully support, question, or challenge the validity of the logistic regression. Moreover, no specified interaction terms are required to be defined by the researcher: the probability machine is provably consistent in the absence of any user-input interaction terms or so-called confounders. 2. Risk machines are based on multiple probability machines, and counterfactual detection engines. They provide provably consistent estimates of all manner of risk effects estimates: log odds, risk ratios, risk differences. Most critically, they provided patient-specific risk estimates. They are entirely model free and can use any number or type of predictors, and allow for arbitrary, unspecified correlation structure in the features. If a specific, classical analysis model such as a logistic regression scheme is known to be correct for the data at hand, then the risk machine will provide estimates than can fully support, question, or challenge the validity of the logistic regression. That is, the risk machine can provide a fully model free validation of a smaller parametric model, if correct, by generating risk effects sizes that agree with the logistic regression model parameters. As with any probability machine, no user-input interaction terms are required: the risk machines can, indeed, be used for interaction detection, in the absence of any parametric model. 3. The introduction of synthetic features considerably expands the classical notion of features or predictors, by allowing the research to assemble new sets of features or networks and allowing a statistical learning machine to then process the data using both original and synthetic features. Typically, a small linear parametric model is invoked to remove the effects of confounders, such as age, gender, population stratification, or more. Unless the model is known to be exactly correct, this treatment of confounders is certain to be in error. The use of synthetic features is a fully nonparametric alternative approach to this problem. 4. Crowd machines can optimally combine the results of any number of learning machines, in a model-free scheme. They can also relieve the researcher from having to optimally set any learning machine tuning parameters. The results of any learning machine analysis therefore become independent of any required tuning parameters, such as support vector learning machine kernel, any details of a neural net. The crowd machine combines detection from any number of machines, specifically allowing for one or another machine to be optimal for some subset of patients and or some subset of features. The crowd machine is not a simple ensemble, or committee or voting scheme. It has been shown to be provably optimal as a statistical data analysis scheme, at least as good as the best machine in the collection. It does not require naming a winner among the collection of machines. Indeed the search for such winners is easily shown to be suboptimal, for example when a machine is best for some portion of the data but not so for other subsets of the dad. 5. Probability machines can be used for feature selection using the new and validated notion of recurrency. No linear ranking of features is ever necessary. In fact, simple examples show that such linear ranking can be inconsistent and contradictory. Features that may be only weakly predictive can be reliably detected using the method of recurrency. That is, the data may have no main effects, no single features that are critical for estimating the personalized probability for an outcome, or the patient-specific risk effect sizes. Yet multiple subsets of features, none strongly predictive, may jointly provide excellent probability and risk estimates. The method of recurrency, and locates these features in the data. 6. Similarly, the method of recurrency can be used to remove features that are clearly noise and that only obscure the truly predictive features in the data. 7. Probability and risk machines can jointly provide nonparametric detection of interacting features. Such detection--entanglement maps--can be undertaken in a fully model-free environment. Simple examples show that interactions among features are often not recovered using the pair-wise products of these features in any model. Entanglement mapping has immediate application to genome-wide interaction detection, even when no single genetic marker, any SNP say, is by itself a predictive feature.
    2 sg:fundingAmount 795593.0
    3 sg:fundingCurrency USD
    4 sg:hasContribution contributions:f8ae1cfcdf13e10b26c3d3b181fcc355
    5 sg:hasFieldOfResearchCode anzsrc-for:01
    6 anzsrc-for:0104
    7 anzsrc-for:08
    8 anzsrc-for:0801
    9 sg:hasFundedPublication articles:0b473c7cf97db8f8e9ac1322ea3ac261
    10 articles:214498b0ddb4aa5f92be5eacc6299598
    11 articles:267f7b5f12984b207426ecaa8599b7e9
    12 articles:2767dbf951eb954fb35ddce1062550d6
    13 articles:2ff50f8cba97990462269af333c73863
    14 articles:361214d7c113b942cb401bab17581009
    15 articles:3694e4b424ddc7855f5a568004cf9f12
    16 articles:487ee72bcc30e27e3216248b22dcf44c
    17 articles:4919de48fc054d99cf8023b4c314db23
    18 articles:4976be8c9b3846da415323cbe3b335aa
    19 articles:49e4b551c2a22d0c859f85671ede60b3
    20 articles:4a75d50e20377b34a29c8b0b13062a1d
    21 articles:4c1224b869e442fd3670caee42935799
    22 articles:5ba91cc7c4ae24d4eab63404abe6756a
    23 articles:5c6c658d7ba2b3bb9e8a459199265a31
    24 articles:737ba490ae0f7652b09298a146b5afb4
    25 articles:74172120a44cc603e723f061055ee02b
    26 articles:7eadf74fcaae26e085dfa9f7b529e4bf
    27 articles:85b04a1b13fed9166b1bfbcc3898b73d
    28 articles:9988529a590dc0932ddffaa05ea65614
    29 articles:a2f15fa07a15b59c13321c295c11d824
    30 articles:a321109a45ff0d9ea53e2efdcf619786
    31 articles:ac084c171dd0a2d1b7b6385a09d692ba
    32 articles:b0f5474f0e061e40d53083646e033ecc
    33 articles:b4bdb0afc411543cafd3acc95bbfc3e6
    34 articles:b7f008c98c913cb318cd4a1fa85e6153
    35 articles:bb0730c0274f3e402022d76477d4ef1e
    36 articles:c0cf86527661c40fcfde393b1bf754c5
    37 articles:c51acf44fde6c65e51286b31e0dea319
    38 articles:c609836fc46ea51261f9d303e5f966da
    39 articles:c84c72e3b5d9bbc35b92cd57ca3dea72
    40 articles:e1208cf098c6c29b44598ce20564bc03
    41 articles:e5872abc3db76f7784c1a71af5c72d20
    42 articles:e79d10af771fbcbacbe408e1ba5d6aa5
    43 articles:ee0474055c54e1ed60661b8a20bd2239
    44 articles:ee65b59710fc31383917fa4fe9c04e9d
    45 articles:f5af54fa772b9b6c9b1564e5b3df9f52
    46 articles:ffa2293602951d0221da6e2b256fbe48
    47 sg:hasFundingOrganization grid-institutes:grid.410422.1
    48 sg:hasRecipientOrganization grid-institutes:grid.410422.1
    49 sg:language English
    50 sg:license http://scigraph.springernature.com/explorer/license/
    51 sg:scigraphId 1e931ce8c52270899ce00d0a3ce6f8c7
    52 sg:title Statistical Learning for Biomedical Data
    53 sg:webpage http://projectreporter.nih.gov/project_info_description.cfm?aid=9146127
    54 rdf:type sg:Grant
    55 rdfs:label Grant: Statistical Learning for Biomedical Data
    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular JSON format for linked data.

    curl -H 'Accept: application/ld+json' 'http://scigraph.springernature.com/things/grants/1e931ce8c52270899ce00d0a3ce6f8c7'

    N-Triples is a line-based linked data format ideal for batch operations .

    curl -H 'Accept: application/n-triples' 'http://scigraph.springernature.com/things/grants/1e931ce8c52270899ce00d0a3ce6f8c7'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'http://scigraph.springernature.com/things/grants/1e931ce8c52270899ce00d0a3ce6f8c7'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'http://scigraph.springernature.com/things/grants/1e931ce8c52270899ce00d0a3ce6f8c7'






    Preview window. Press ESC to close (or click here)


    ...