
AUTHORS TITLEStatistical Learning for Biomedical Data
ABSTRACTThis project studies statistical learning machines as applied to personalized biomedical and clinical probability estimates for outcomes, patientspecific risk estimation, synthetic features, noise detection, feature selection. In more detail: 1. Probability machines can generate personalized probability predictions for multiple phenotypes and outcomes, such as tumor, not tumor. These methods fully supercede simple classification methods, those that only generate zero or one predictions. The distinction is this: A pure classification scheme will produce the same prediction for these two outcomes: an 85% chance of tumor, and a 58% chance of tumor. These outcomes can be expected to have distinct and critical patient level evaluations, prognosis, and treatment plans, specific to patient subgroups. A probability machine produces provably consistent probability outcomes (85% or 58%) for each patient, and does so using any number or type of predictors, with no model specification required, and arbitrary correlation structure in the features. Thus, a probability machine is a significantly better use of the available information in the data. If a specific, classical analysis model such as a logistic regression scheme is assumed to be exactly correct for the data, then the probability machine will provide estimates than can fully support, question, or challenge the validity of the logistic regression. Moreover, no specified interaction terms are required to be defined by the researcher: the probability machine is provably consistent in the absence of any userinput interaction terms or socalled confounders. 2. Risk machines are based on multiple probability machines, and counterfactual detection engines. They provide provably consistent estimates of all manner of risk effects estimates: log odds, risk ratios, risk differences. Most critically, they provided patientspecific risk estimates. They are entirely model free and can use any number or type of predictors, and allow for arbitrary, unspecified correlation structure in the features. If a specific, classical analysis model such as a logistic regression scheme is known to be correct for the data at hand, then the risk machine will provide estimates than can fully support, question, or challenge the validity of the logistic regression. That is, the risk machine can provide a fully model free validation of a smaller parametric model, if correct, by generating risk effects sizes that agree with the logistic regression model parameters. As with any probability machine, no userinput interaction terms are required: the risk machines can, indeed, be used for interaction detection, in the absence of any parametric model. 3. The introduction of synthetic features considerably expands the classical notion of features or predictors, by allowing the research to assemble new sets of features or networks and allowing a statistical learning machine to then process the data using both original and synthetic features. Typically, a small linear parametric model is invoked to remove the effects of confounders, such as age, gender, population stratification, or more. Unless the model is known to be exactly correct, this treatment of confounders is certain to be in error. The use of synthetic features is a fully nonparametric alternative approach to this problem. 4. Crowd machines can optimally combine the results of any number of learning machines, in a modelfree scheme. They can also relieve the researcher from having to optimally set any learning machine tuning parameters. The results of any learning machine analysis therefore become independent of any required tuning parameters, such as support vector learning machine kernel, any details of a neural net. The crowd machine combines detection from any number of machines, specifically allowing for one or another machine to be optimal for some subset of patients and or some subset of features. The crowd machine is not a simple ensemble, or committee or voting scheme. It has been shown to be provably optimal as a statistical data analysis scheme, at least as good as the best machine in the collection. It does not require naming a winner among the collection of machines. Indeed the search for such winners is easily shown to be suboptimal, for example when a machine is best for some portion of the data but not so for other subsets of the dad. 5. Probability machines can be used for feature selection using the new and validated notion of recurrency. No linear ranking of features is ever necessary. In fact, simple examples show that such linear ranking can be inconsistent and contradictory. Features that may be only weakly predictive can be reliably detected using the method of recurrency. That is, the data may have no main effects, no single features that are critical for estimating the personalized probability for an outcome, or the patientspecific risk effect sizes. Yet multiple subsets of features, none strongly predictive, may jointly provide excellent probability and risk estimates. The method of recurrency, and locates these features in the data. 6. Similarly, the method of recurrency can be used to remove features that are clearly noise and that only obscure the truly predictive features in the data. 7. Probability and risk machines can jointly provide nonparametric detection of interacting features. Such detectionentanglement mapscan be undertaken in a fully modelfree environment. Simple examples show that interactions among features are often not recovered using the pairwise products of these features in any model. Entanglement mapping has immediate application to genomewide interaction detection, even when no single genetic marker, any SNP say, is by itself a predictive feature.
FUNDED PUBLICATIONS
Download the RDF metadata as: jsonld nt turtle xml License info
55 TRIPLES 15 PREDICATES 55 URIs 7 LITERALS
Subject  Predicate  Object  

1  grants:1e931ce8c52270899ce00d0a3ce6f8c7  sg:abstract  This project studies statistical learning machines as applied to personalized biomedical and clinical probability estimates for outcomes, patientspecific risk estimation, synthetic features, noise detection, feature selection. In more detail: 1. Probability machines can generate personalized probability predictions for multiple phenotypes and outcomes, such as tumor, not tumor. These methods fully supercede simple classification methods, those that only generate zero or one predictions. The distinction is this: A pure classification scheme will produce the same prediction for these two outcomes: an 85% chance of tumor, and a 58% chance of tumor. These outcomes can be expected to have distinct and critical patient level evaluations, prognosis, and treatment plans, specific to patient subgroups. A probability machine produces provably consistent probability outcomes (85% or 58%) for each patient, and does so using any number or type of predictors, with no model specification required, and arbitrary correlation structure in the features. Thus, a probability machine is a significantly better use of the available information in the data. If a specific, classical analysis model such as a logistic regression scheme is assumed to be exactly correct for the data, then the probability machine will provide estimates than can fully support, question, or challenge the validity of the logistic regression. Moreover, no specified interaction terms are required to be defined by the researcher: the probability machine is provably consistent in the absence of any userinput interaction terms or socalled confounders. 2. Risk machines are based on multiple probability machines, and counterfactual detection engines. They provide provably consistent estimates of all manner of risk effects estimates: log odds, risk ratios, risk differences. Most critically, they provided patientspecific risk estimates. They are entirely model free and can use any number or type of predictors, and allow for arbitrary, unspecified correlation structure in the features. If a specific, classical analysis model such as a logistic regression scheme is known to be correct for the data at hand, then the risk machine will provide estimates than can fully support, question, or challenge the validity of the logistic regression. That is, the risk machine can provide a fully model free validation of a smaller parametric model, if correct, by generating risk effects sizes that agree with the logistic regression model parameters. As with any probability machine, no userinput interaction terms are required: the risk machines can, indeed, be used for interaction detection, in the absence of any parametric model. 3. The introduction of synthetic features considerably expands the classical notion of features or predictors, by allowing the research to assemble new sets of features or networks and allowing a statistical learning machine to then process the data using both original and synthetic features. Typically, a small linear parametric model is invoked to remove the effects of confounders, such as age, gender, population stratification, or more. Unless the model is known to be exactly correct, this treatment of confounders is certain to be in error. The use of synthetic features is a fully nonparametric alternative approach to this problem. 4. Crowd machines can optimally combine the results of any number of learning machines, in a modelfree scheme. They can also relieve the researcher from having to optimally set any learning machine tuning parameters. The results of any learning machine analysis therefore become independent of any required tuning parameters, such as support vector learning machine kernel, any details of a neural net. The crowd machine combines detection from any number of machines, specifically allowing for one or another machine to be optimal for some subset of patients and or some subset of features. The crowd machine is not a simple ensemble, or committee or voting scheme. It has been shown to be provably optimal as a statistical data analysis scheme, at least as good as the best machine in the collection. It does not require naming a winner among the collection of machines. Indeed the search for such winners is easily shown to be suboptimal, for example when a machine is best for some portion of the data but not so for other subsets of the dad. 5. Probability machines can be used for feature selection using the new and validated notion of recurrency. No linear ranking of features is ever necessary. In fact, simple examples show that such linear ranking can be inconsistent and contradictory. Features that may be only weakly predictive can be reliably detected using the method of recurrency. That is, the data may have no main effects, no single features that are critical for estimating the personalized probability for an outcome, or the patientspecific risk effect sizes. Yet multiple subsets of features, none strongly predictive, may jointly provide excellent probability and risk estimates. The method of recurrency, and locates these features in the data. 6. Similarly, the method of recurrency can be used to remove features that are clearly noise and that only obscure the truly predictive features in the data. 7. Probability and risk machines can jointly provide nonparametric detection of interacting features. Such detectionentanglement mapscan be undertaken in a fully modelfree environment. Simple examples show that interactions among features are often not recovered using the pairwise products of these features in any model. Entanglement mapping has immediate application to genomewide interaction detection, even when no single genetic marker, any SNP say, is by itself a predictive feature. 
2  ″  sg:fundingAmount  795593.0 
3  ″  sg:fundingCurrency  USD 
4  ″  sg:hasContribution  contributions:f8ae1cfcdf13e10b26c3d3b181fcc355 
5  ″  sg:hasFieldOfResearchCode  anzsrcfor:01 
6  ″  ″  anzsrcfor:0104 
7  ″  ″  anzsrcfor:08 
8  ″  ″  anzsrcfor:0801 
9  ″  sg:hasFundedPublication  articles:0b473c7cf97db8f8e9ac1322ea3ac261 
10  ″  ″  articles:214498b0ddb4aa5f92be5eacc6299598 
11  ″  ″  articles:267f7b5f12984b207426ecaa8599b7e9 
12  ″  ″  articles:2767dbf951eb954fb35ddce1062550d6 
13  ″  ″  articles:2ff50f8cba97990462269af333c73863 
14  ″  ″  articles:361214d7c113b942cb401bab17581009 
15  ″  ″  articles:3694e4b424ddc7855f5a568004cf9f12 
16  ″  ″  articles:487ee72bcc30e27e3216248b22dcf44c 
17  ″  ″  articles:4919de48fc054d99cf8023b4c314db23 
18  ″  ″  articles:4976be8c9b3846da415323cbe3b335aa 
19  ″  ″  articles:49e4b551c2a22d0c859f85671ede60b3 
20  ″  ″  articles:4a75d50e20377b34a29c8b0b13062a1d 
21  ″  ″  articles:4c1224b869e442fd3670caee42935799 
22  ″  ″  articles:5ba91cc7c4ae24d4eab63404abe6756a 
23  ″  ″  articles:5c6c658d7ba2b3bb9e8a459199265a31 
24  ″  ″  articles:737ba490ae0f7652b09298a146b5afb4 
25  ″  ″  articles:74172120a44cc603e723f061055ee02b 
26  ″  ″  articles:7eadf74fcaae26e085dfa9f7b529e4bf 
27  ″  ″  articles:85b04a1b13fed9166b1bfbcc3898b73d 
28  ″  ″  articles:9988529a590dc0932ddffaa05ea65614 
29  ″  ″  articles:a2f15fa07a15b59c13321c295c11d824 
30  ″  ″  articles:a321109a45ff0d9ea53e2efdcf619786 
31  ″  ″  articles:ac084c171dd0a2d1b7b6385a09d692ba 
32  ″  ″  articles:b0f5474f0e061e40d53083646e033ecc 
33  ″  ″  articles:b4bdb0afc411543cafd3acc95bbfc3e6 
34  ″  ″  articles:b7f008c98c913cb318cd4a1fa85e6153 
35  ″  ″  articles:bb0730c0274f3e402022d76477d4ef1e 
36  ″  ″  articles:c0cf86527661c40fcfde393b1bf754c5 
37  ″  ″  articles:c51acf44fde6c65e51286b31e0dea319 
38  ″  ″  articles:c609836fc46ea51261f9d303e5f966da 
39  ″  ″  articles:c84c72e3b5d9bbc35b92cd57ca3dea72 
40  ″  ″  articles:e1208cf098c6c29b44598ce20564bc03 
41  ″  ″  articles:e5872abc3db76f7784c1a71af5c72d20 
42  ″  ″  articles:e79d10af771fbcbacbe408e1ba5d6aa5 
43  ″  ″  articles:ee0474055c54e1ed60661b8a20bd2239 
44  ″  ″  articles:ee65b59710fc31383917fa4fe9c04e9d 
45  ″  ″  articles:f5af54fa772b9b6c9b1564e5b3df9f52 
46  ″  ″  articles:ffa2293602951d0221da6e2b256fbe48 
47  ″  sg:hasFundingOrganization  gridinstitutes:grid.410422.1 
48  ″  sg:hasRecipientOrganization  gridinstitutes:grid.410422.1 
49  ″  sg:language  English 
50  ″  sg:license  http://scigraph.springernature.com/explorer/license/ 
51  ″  sg:scigraphId  1e931ce8c52270899ce00d0a3ce6f8c7 
52  ″  sg:title  Statistical Learning for Biomedical Data 
53  ″  sg:webpage  http://projectreporter.nih.gov/project_info_description.cfm?aid=9146127 
54  ″  rdf:type  sg:Grant 
55  ″  rdfs:label  Grant: Statistical Learning for Biomedical Data 
JSONLD is a popular JSON format for linked data.
curl H 'Accept: application/ld+json' 'http://scigraph.springernature.com/things/grants/1e931ce8c52270899ce00d0a3ce6f8c7'
NTriples is a linebased linked data format ideal for batch operations .
curl H 'Accept: application/ntriples' 'http://scigraph.springernature.com/things/grants/1e931ce8c52270899ce00d0a3ce6f8c7'
Turtle is a humanreadable linked data format.
curl H 'Accept: text/turtle' 'http://scigraph.springernature.com/things/grants/1e931ce8c52270899ce00d0a3ce6f8c7'
RDF/XML is a standard XML format for linked data.
curl H 'Accept: application/rdf+xml' 'http://scigraph.springernature.com/things/grants/1e931ce8c52270899ce00d0a3ce6f8c7'