YEARS

-

AUTHORS

Willy Wilbur

TITLE

General and Semi-supervised Machine Learning Applied to Bioinformatics

ABSTRACT

1) Many different methods have been investigated for the purpose of clustering sets of documents with the hope of improving retrieval. Unfortunately these have generally failed to provide improved retrieval capability. Part of the problem is clearly the fact that a given document often involves more than one subject so that it is not possible to make a clean categorization of the documents into definite categories to the exclusion of others. In order to overcome this problem we have developed methods that are designed to identify a theme among a set of documents. The theme need not encompass the whole of any document. It only needs to exist in some subset of the documents in order to be identifiable. Some of these same documents may participate in the definition of several themes. One method of finding themes is based on the EM algorithm and requires an iterative procedure which converges to themes. The method has been implemented and tested and found to be successful. 2) A second approach can be based on the singular value decomposition and essentially is a vector approach. 3) We are also investigating other methods to extract higher level features. One method we are currently studying is to perform machine learning with an SVM or other classifier and score the documents based on this learning. Then PAV can be applied to the resulting scores and this score function can be descretized without the loss of significant information. This allows us to make use of the results as features which can be individually weighted in another classifier. 4) We have developed a new algorithm called the periodic random orbiter algorithm (PROBE) which is applicable to minimize any convex loss function. We have applied it to the MeSH classification problem and it seems to work very well and better than the alternatives on such a large problem. 5) Stochastic Gradient Descent (SGD) has gained popularity for solving large scale supervised machine learning problems. It provides a rapid method for minimizing a number of loss functions and is applicable to Support Vector Machine (SVM) and Logistic optimizations. However SGD does not provide a convenient stopping criterion. Generally an optimal number of iterations over the data may be determined using held out data. We have compared stopping predictions based on held out data with simply stopping at a fixed number of iterations and found that the latter works as well as the former for a number of commonly studied text classification problems. In particular fixed stopping works well for MeSH predictions on PubMed records. We also surveyed the published algorithms for SVM learning on large data sets, and chose three for comparison: PROBE, SVMperf, and Liblinear and compared them with SGD with a fixed number of iterations. We find SGD with a fixed number of iterations performs as well as these alternative methods and is much faster to compute. As an application we have made SGD-SVM predictions for all MeSH terms and used the Pool Adjacent Violators (PAV) algorithm to convert these predictions to probabilities. Such probabilistic predictions lead to ranked MeSH term predictions superior to previously published results on two test sets 6) We are also investigating methods to create features for machine learning using dependency parses and syntactic parse trees.

FUNDED PUBLICATIONS

  • The Ineffectiveness of Within - Document Term Frequency in Text Classification.
  • Finding related sentence pairs in MEDLINE
  • Identifying named entities from PubMed for enriching semantic categories.
  • Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE.
  • A Study of the Morpho-Semantic Relationship in Medline.
  • Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach.
  • Identifying named entities from PubMed® for enriching semantic categories
  • Assisting manual literature curation for protein-protein interactions using BioQRator.
  • The ineffectiveness of within-document term frequency in text classification
  • Thematic clustering of text documents using an EM-based approach.
  • Machine learning with naturally labeled data for identifying abbreviation definitions.
  • Finding related sentence pairs in MEDLINE.
  • Multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users.
  • An overview of the BioCreative 2012 Workshop Track III: interactive text mining task.
  • Improving a gold standard: treating human relevance judgments of MEDLINE document pairs.
  • How to use: Click on a object to move its position. Double click to open its homepage. Right click to preview its contents.

    Download the RDF metadata as:   json-ld nt turtle xml License info


    30 TRIPLES      15 PREDICATES      30 URIs      7 LITERALS

    Subject Predicate Object
    1 grants:fc6d7752a3ed013977faa121ea0d3f00 sg:abstract 1) Many different methods have been investigated for the purpose of clustering sets of documents with the hope of improving retrieval. Unfortunately these have generally failed to provide improved retrieval capability. Part of the problem is clearly the fact that a given document often involves more than one subject so that it is not possible to make a clean categorization of the documents into definite categories to the exclusion of others. In order to overcome this problem we have developed methods that are designed to identify a theme among a set of documents. The theme need not encompass the whole of any document. It only needs to exist in some subset of the documents in order to be identifiable. Some of these same documents may participate in the definition of several themes. One method of finding themes is based on the EM algorithm and requires an iterative procedure which converges to themes. The method has been implemented and tested and found to be successful. 2) A second approach can be based on the singular value decomposition and essentially is a vector approach. 3) We are also investigating other methods to extract higher level features. One method we are currently studying is to perform machine learning with an SVM or other classifier and score the documents based on this learning. Then PAV can be applied to the resulting scores and this score function can be descretized without the loss of significant information. This allows us to make use of the results as features which can be individually weighted in another classifier. 4) We have developed a new algorithm called the periodic random orbiter algorithm (PROBE) which is applicable to minimize any convex loss function. We have applied it to the MeSH classification problem and it seems to work very well and better than the alternatives on such a large problem. 5) Stochastic Gradient Descent (SGD) has gained popularity for solving large scale supervised machine learning problems. It provides a rapid method for minimizing a number of loss functions and is applicable to Support Vector Machine (SVM) and Logistic optimizations. However SGD does not provide a convenient stopping criterion. Generally an optimal number of iterations over the data may be determined using held out data. We have compared stopping predictions based on held out data with simply stopping at a fixed number of iterations and found that the latter works as well as the former for a number of commonly studied text classification problems. In particular fixed stopping works well for MeSH predictions on PubMed records. We also surveyed the published algorithms for SVM learning on large data sets, and chose three for comparison: PROBE, SVMperf, and Liblinear and compared them with SGD with a fixed number of iterations. We find SGD with a fixed number of iterations performs as well as these alternative methods and is much faster to compute. As an application we have made SGD-SVM predictions for all MeSH terms and used the Pool Adjacent Violators (PAV) algorithm to convert these predictions to probabilities. Such probabilistic predictions lead to ranked MeSH term predictions superior to previously published results on two test sets 6) We are also investigating methods to create features for machine learning using dependency parses and syntactic parse trees.
    2 sg:fundingAmount 3216631.0
    3 sg:fundingCurrency USD
    4 sg:hasContribution contributions:1103918851717e5a4e840b334f41891f
    5 sg:hasFieldOfResearchCode anzsrc-for:08
    6 anzsrc-for:0801
    7 sg:hasFundedPublication articles:15e9734ec9716b918b91e109a4e823c9
    8 articles:2bf3020bfd19fec0fdae0539951f54f3
    9 articles:32d468eef58ca2ee8662badf39059200
    10 articles:3e4d23ddab0f8c1581a9dd0c636e60c7
    11 articles:7be08c15f73af55829334d59c128c4ed
    12 articles:96dc0dcb038f89ae3305638b6ffa2f8d
    13 articles:ac3af79955c5fb01e343e14f17540a1a
    14 articles:b3c5309f0539e74bf1ed2d39df15ed3f
    15 articles:c0eb48fc7b77931318b7cb1bf7afabbb
    16 articles:caf79fce1577aab63105c1d4066dde00
    17 articles:d30c0783e252e7cc76a2b3094e3d907f
    18 articles:ef97950796cbebb462c7d749e151c8d9
    19 articles:f11168d6ea77b12ddb935f0c73aecf2e
    20 articles:f618641f6af5fca7296b8187e02a32f8
    21 articles:faf7ee9626594ac42521763c518695e0
    22 sg:hasFundingOrganization grid-institutes:grid.280285.5
    23 sg:hasRecipientOrganization grid-institutes:grid.280285.5
    24 sg:language English
    25 sg:license http://scigraph.springernature.com/explorer/license/
    26 sg:scigraphId fc6d7752a3ed013977faa121ea0d3f00
    27 sg:title General and Semi-supervised Machine Learning Applied to Bioinformatics
    28 sg:webpage http://projectreporter.nih.gov/project_info_description.cfm?aid=9160914
    29 rdf:type sg:Grant
    30 rdfs:label Grant: General and Semi-supervised Machine Learning Applied to Bioinformatics
    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular JSON format for linked data.

    curl -H 'Accept: application/ld+json' 'http://scigraph.springernature.com/things/grants/fc6d7752a3ed013977faa121ea0d3f00'

    N-Triples is a line-based linked data format ideal for batch operations .

    curl -H 'Accept: application/n-triples' 'http://scigraph.springernature.com/things/grants/fc6d7752a3ed013977faa121ea0d3f00'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'http://scigraph.springernature.com/things/grants/fc6d7752a3ed013977faa121ea0d3f00'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'http://scigraph.springernature.com/things/grants/fc6d7752a3ed013977faa121ea0d3f00'






    Preview window. Press ESC to close (or click here)


    ...