YEARS

2008-2010

AUTHORS

Mark Girolami

TITLE

Advancing Machine Learning Methodology for New Classes of Prediction Problems

ABSTRACT

The last few decades have seen enormous progress in the development of machine learning and pattern recognition algorithms for data classification. This has resulted in considerable advances in a number of applied fields, with some of these algorithms forming the core of ubiquitous deployed technologies. However there exist very many important applications, for example in biomedicine, which are highly non-standard prediction problems, and there is an urgent need to develop appropriate & effective classification techniques for such applications. For example, at NIPS2006 Girolami & Zhong reported state of the art prediction accuracy for a protein fold classification problem which stands at a modest 62%. While this may partly be due to overlaps between classes of fold, it is also clear that some of the fundamental assumptions made by most classification algorithms are not valid in this application. In particular, most algorithms make some assumptions on the structure of the data that are not met in reality: data (both training and test) is independent and identically distributed (i.i.d) from the same distribution, labels are unbiased (i.e. the relative proportions of positive and negative examples are approximately balanced) and the presence of labeling noise both on the input data and on the labels can be largely ignored. Recent advances in Machine Learning, such as kernel based methods and the availability of efficient computational methods for Bayesian inference, hold great promise that classification problems in non-standard situations can be addressed in a principled way. The development of effective classification tools is all the more urgent given the daunting pace at which technological advances are producing novel data sets. This is particularly true in the life sciences, where advances in molecular biology and proteomics are leading to the production of vast amounts of data, necessitating the development of methods for high-throughput automated analysis. Improving classification accuracy may lead to the removal of what is currently the bottleneck in the analysis of this type of data, leading to real impact in furthering biomedical research and in the life quality of millions of people. At present most classifiers used in life sciences applications, especially those deployed as bioinformatics web services, adopt & adapt traditional Machine Learning approaches, quite often in an ad hoc manner, e.g. employing Artificial Neural Networks & Support Vector Machines. However, in reality many of these applications are highly non-standard classification problems in the sense that a number of the fundamental underlying assumptions of pattern classification and decision theory (e.g. identical sampling distributions for 'training' and 'test' data, perfect noiseless labeling in the discrete case, object representations which can be embedded in a common feature space) are violated and this has a direct and potentially highly negative impact on achievable performance. To make much needed & significant progress on a wide range of important applications there is an urgent requirement to systematically address the associated methodological issues within a common framework and this is what motivates the current proposal.

FUNDED PUBLICATIONS

  • Addressing the Challenge of Defining Valid Proteomic Biomarkers and Classifiers
  • Addressing the challenge of defining valid proteomic biomarkers and classifiers.
  • How to use: Click on a object to move its position. Double click to open its homepage. Right click to preview its contents.

    Download the RDF metadata as:   json-ld nt turtle xml License info


    22 TRIPLES      17 PREDICATES      23 URIs      10 LITERALS

    Subject Predicate Object
    1 grants:9343b32ddd82235f06615666828e9d21 sg:abstract The last few decades have seen enormous progress in the development of machine learning and pattern recognition algorithms for data classification. This has resulted in considerable advances in a number of applied fields, with some of these algorithms forming the core of ubiquitous deployed technologies. However there exist very many important applications, for example in biomedicine, which are highly non-standard prediction problems, and there is an urgent need to develop appropriate & effective classification techniques for such applications. For example, at NIPS2006 Girolami & Zhong reported state of the art prediction accuracy for a protein fold classification problem which stands at a modest 62%. While this may partly be due to overlaps between classes of fold, it is also clear that some of the fundamental assumptions made by most classification algorithms are not valid in this application. In particular, most algorithms make some assumptions on the structure of the data that are not met in reality: data (both training and test) is independent and identically distributed (i.i.d) from the same distribution, labels are unbiased (i.e. the relative proportions of positive and negative examples are approximately balanced) and the presence of labeling noise both on the input data and on the labels can be largely ignored. Recent advances in Machine Learning, such as kernel based methods and the availability of efficient computational methods for Bayesian inference, hold great promise that classification problems in non-standard situations can be addressed in a principled way. The development of effective classification tools is all the more urgent given the daunting pace at which technological advances are producing novel data sets. This is particularly true in the life sciences, where advances in molecular biology and proteomics are leading to the production of vast amounts of data, necessitating the development of methods for high-throughput automated analysis. Improving classification accuracy may lead to the removal of what is currently the bottleneck in the analysis of this type of data, leading to real impact in furthering biomedical research and in the life quality of millions of people. At present most classifiers used in life sciences applications, especially those deployed as bioinformatics web services, adopt & adapt traditional Machine Learning approaches, quite often in an ad hoc manner, e.g. employing Artificial Neural Networks & Support Vector Machines. However, in reality many of these applications are highly non-standard classification problems in the sense that a number of the fundamental underlying assumptions of pattern classification and decision theory (e.g. identical sampling distributions for 'training' and 'test' data, perfect noiseless labeling in the discrete case, object representations which can be embedded in a common feature space) are violated and this has a direct and potentially highly negative impact on achievable performance. To make much needed & significant progress on a wide range of important applications there is an urgent requirement to systematically address the associated methodological issues within a common framework and this is what motivates the current proposal.
    2 sg:endYear 2010
    3 sg:fundingAmount 210198.0
    4 sg:fundingCurrency GBP
    5 sg:hasContribution contributions:ecd08cf4b9768ae4a0c89938138b7803
    6 sg:hasFieldOfResearchCode anzsrc-for:01
    7 anzsrc-for:0104
    8 anzsrc-for:08
    9 anzsrc-for:0801
    10 sg:hasFundedPublication articles:cbea610e12cd3769e180b7541fd1c43a
    11 articles:d95e4933117c104c29c8ed0bf564b1d6
    12 sg:hasFundingOrganization grid-institutes:grid.421091.f
    13 sg:hasRecipientOrganization grid-institutes:grid.8756.c
    14 sg:language English
    15 sg:license http://scigraph.springernature.com/explorer/license/
    16 Contains UK public sector information licensed under the Open Government Licence v2.0 (http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/).
    17 sg:scigraphId 9343b32ddd82235f06615666828e9d21
    18 sg:startYear 2008
    19 sg:title Advancing Machine Learning Methodology for New Classes of Prediction Problems
    20 sg:webpage http://gtr.rcuk.ac.uk/project/EBDB4323-4907-4E2A-BFCB-E0AA252F3A39
    21 rdf:type sg:Grant
    22 rdfs:label Grant: Advancing Machine Learning Methodology for New Classes of Prediction Problems
    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular JSON format for linked data.

    curl -H 'Accept: application/ld+json' 'http://scigraph.springernature.com/things/grants/9343b32ddd82235f06615666828e9d21'

    N-Triples is a line-based linked data format ideal for batch operations .

    curl -H 'Accept: application/n-triples' 'http://scigraph.springernature.com/things/grants/9343b32ddd82235f06615666828e9d21'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'http://scigraph.springernature.com/things/grants/9343b32ddd82235f06615666828e9d21'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'http://scigraph.springernature.com/things/grants/9343b32ddd82235f06615666828e9d21'






    Preview window. Press ESC to close (or click here)


    ...