A comparison of machine learning and Bayesian modelling for molecular serotyping View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2017-12

AUTHORS

Richard Newton, Lorenz Wernisch

ABSTRACT

BACKGROUND: Streptococcus pneumoniae is a human pathogen that is a major cause of infant mortality. Identifying the pneumococcal serotype is an important step in monitoring the impact of vaccines used to protect against disease. Genomic microarrays provide an effective method for molecular serotyping. Previously we developed an empirical Bayesian model for the classification of serotypes from a molecular serotyping array. With only few samples available, a model driven approach was the only option. In the meanwhile, several thousand samples have been made available to us, providing an opportunity to investigate serotype classification by machine learning methods, which could complement the Bayesian model. RESULTS: We compare the performance of the original Bayesian model with two machine learning algorithms: Gradient Boosting Machines and Random Forests. We present our results as an example of a generic strategy whereby a preliminary probabilistic model is complemented or replaced by a machine learning classifier once enough data are available. Despite the availability of thousands of serotyping arrays, a problem encountered when applying machine learning methods is the lack of training data containing mixtures of serotypes; due to the large number of possible combinations. Most of the available training data comprises samples with only a single serotype. To overcome the lack of training data we implemented an iterative analysis, creating artificial training data of serotype mixtures by combining raw data from single serotype arrays. CONCLUSIONS: With the enhanced training set the machine learning algorithms out perform the original Bayesian model. However, for serotypes currently lacking sufficient training data the best performing implementation was a combination of the results of the Bayesian Model and the Gradient Boosting Machine. As well as being an effective method for classifying biological data, machine learning can also be used as an efficient method for revealing subtle biological insights, which we illustrate with an example. More... »

PAGES

606

References to SciGraph publications

Identifiers

URI

http://scigraph.springernature.com/pub.10.1186/s12864-017-3998-6

DOI

http://dx.doi.org/10.1186/s12864-017-3998-6

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1091167089

PUBMED

https://www.ncbi.nlm.nih.gov/pubmed/28800724


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Artificial Intelligence and Image Processing", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Bayes Theorem", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Machine Learning", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Models, Statistical", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Oligonucleotide Array Sequence Analysis", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Serotyping", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Streptococcus pneumoniae", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "MRC Biostatistics Unit", 
          "id": "https://www.grid.ac/institutes/grid.415038.b", 
          "name": [
            "MRC Biostatistics Unit, Robinson Way, CB2 0SR, Cambridge, UK"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Newton", 
        "givenName": "Richard", 
        "id": "sg:person.016650740112.04", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.016650740112.04"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "MRC Biostatistics Unit", 
          "id": "https://www.grid.ac/institutes/grid.415038.b", 
          "name": [
            "MRC Biostatistics Unit, Robinson Way, CB2 0SR, Cambridge, UK"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Wernisch", 
        "givenName": "Lorenz", 
        "id": "sg:person.01132465512.22", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01132465512.22"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "sg:pub.10.1186/1471-2105-12-88", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1017262754", 
          "https://doi.org/10.1186/1471-2105-12-88"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1214/aos/1013203451", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1030645893"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1371/journal.pmed.1001903", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1050170479"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2017-12", 
    "datePublishedReg": "2017-12-01", 
    "description": "BACKGROUND: Streptococcus pneumoniae is a human pathogen that is a major cause of infant mortality. Identifying the pneumococcal serotype is an important step in monitoring the impact of vaccines used to protect against disease. Genomic microarrays provide an effective method for molecular serotyping. Previously we developed an empirical Bayesian model for the classification of serotypes from a molecular serotyping array. With only few samples available, a model driven approach was the only option. In the meanwhile, several thousand samples have been made available to us, providing an opportunity to investigate serotype classification by machine learning methods, which could complement the Bayesian model.\nRESULTS: We compare the performance of the original Bayesian model with two machine learning algorithms: Gradient Boosting Machines and Random Forests. We present our results as an example of a generic strategy whereby a preliminary probabilistic model is complemented or replaced by a machine learning classifier once enough data are available. Despite the availability of thousands of serotyping arrays, a problem encountered when applying machine learning methods is the lack of training data containing mixtures of serotypes; due to the large number of possible combinations. Most of the available training data comprises samples with only a single serotype. To overcome the lack of training data we implemented an iterative analysis, creating artificial training data of serotype mixtures by combining raw data from single serotype arrays.\nCONCLUSIONS: With the enhanced training set the machine learning algorithms out perform the original Bayesian model. However, for serotypes currently lacking sufficient training data the best performing implementation was a combination of the results of the Bayesian Model and the Gradient Boosting Machine. As well as being an effective method for classifying biological data, machine learning can also be used as an efficient method for revealing subtle biological insights, which we illustrate with an example.", 
    "genre": "research_article", 
    "id": "sg:pub.10.1186/s12864-017-3998-6", 
    "inLanguage": [
      "en"
    ], 
    "isAccessibleForFree": true, 
    "isFundedItemOf": [
      {
        "id": "sg:grant.2771293", 
        "type": "MonetaryGrant"
      }, 
      {
        "id": "sg:grant.7611026", 
        "type": "MonetaryGrant"
      }
    ], 
    "isPartOf": [
      {
        "id": "sg:journal.1023790", 
        "issn": [
          "1471-2164"
        ], 
        "name": "BMC Genomics", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "1", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "18"
      }
    ], 
    "name": "A comparison of machine learning and Bayesian modelling for molecular serotyping", 
    "pagination": "606", 
    "productId": [
      {
        "name": "readcube_id", 
        "type": "PropertyValue", 
        "value": [
          "26713ccf2386b6af4546bfdab01c456f9d394ea1107acc89892f9f2b73a90c11"
        ]
      }, 
      {
        "name": "pubmed_id", 
        "type": "PropertyValue", 
        "value": [
          "28800724"
        ]
      }, 
      {
        "name": "nlm_unique_id", 
        "type": "PropertyValue", 
        "value": [
          "100965258"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1186/s12864-017-3998-6"
        ]
      }, 
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1091167089"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1186/s12864-017-3998-6", 
      "https://app.dimensions.ai/details/publication/pub.1091167089"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2019-04-11T09:52", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000347_0000000347/records_89790_00000003.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "https://link.springer.com/10.1186%2Fs12864-017-3998-6"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1186/s12864-017-3998-6'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1186/s12864-017-3998-6'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1186/s12864-017-3998-6'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1186/s12864-017-3998-6'


 

This table displays all metadata directly associated to this object as RDF triples.

113 TRIPLES      21 PREDICATES      38 URIs      27 LITERALS      15 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1186/s12864-017-3998-6 schema:about N399bfae7402a4d4a95e67074a428ac7a
2 N4a850d32538143259f6daad4656f5831
3 N5c12492391b84b3488a05c140a691d25
4 N9760c0c0f7534699a25a5f094d4e4e3c
5 Nafe3f59663ac4ca89796128b6c6a6b69
6 Ne174698a7c78422f9673fd63b439d907
7 anzsrc-for:08
8 anzsrc-for:0801
9 schema:author N6f226747ad894398b4bcf5b5176dc2c2
10 schema:citation sg:pub.10.1186/1471-2105-12-88
11 https://doi.org/10.1214/aos/1013203451
12 https://doi.org/10.1371/journal.pmed.1001903
13 schema:datePublished 2017-12
14 schema:datePublishedReg 2017-12-01
15 schema:description BACKGROUND: Streptococcus pneumoniae is a human pathogen that is a major cause of infant mortality. Identifying the pneumococcal serotype is an important step in monitoring the impact of vaccines used to protect against disease. Genomic microarrays provide an effective method for molecular serotyping. Previously we developed an empirical Bayesian model for the classification of serotypes from a molecular serotyping array. With only few samples available, a model driven approach was the only option. In the meanwhile, several thousand samples have been made available to us, providing an opportunity to investigate serotype classification by machine learning methods, which could complement the Bayesian model. RESULTS: We compare the performance of the original Bayesian model with two machine learning algorithms: Gradient Boosting Machines and Random Forests. We present our results as an example of a generic strategy whereby a preliminary probabilistic model is complemented or replaced by a machine learning classifier once enough data are available. Despite the availability of thousands of serotyping arrays, a problem encountered when applying machine learning methods is the lack of training data containing mixtures of serotypes; due to the large number of possible combinations. Most of the available training data comprises samples with only a single serotype. To overcome the lack of training data we implemented an iterative analysis, creating artificial training data of serotype mixtures by combining raw data from single serotype arrays. CONCLUSIONS: With the enhanced training set the machine learning algorithms out perform the original Bayesian model. However, for serotypes currently lacking sufficient training data the best performing implementation was a combination of the results of the Bayesian Model and the Gradient Boosting Machine. As well as being an effective method for classifying biological data, machine learning can also be used as an efficient method for revealing subtle biological insights, which we illustrate with an example.
16 schema:genre research_article
17 schema:inLanguage en
18 schema:isAccessibleForFree true
19 schema:isPartOf N1a68f2bdb6e047458273368022fab841
20 N1f5148559ab94e4daafa51814f51fcba
21 sg:journal.1023790
22 schema:name A comparison of machine learning and Bayesian modelling for molecular serotyping
23 schema:pagination 606
24 schema:productId N010b1fded16c4aeab0cdb421a1fd14a3
25 N25cf023060714e8fb7e31f250d59fb15
26 Na2ce48745d154877a51e828a5dd057b4
27 Ncc42cf2ed6df42f5920605459ad6ab24
28 Ne99f76a807744e0ea04eaeeae2a70e59
29 schema:sameAs https://app.dimensions.ai/details/publication/pub.1091167089
30 https://doi.org/10.1186/s12864-017-3998-6
31 schema:sdDatePublished 2019-04-11T09:52
32 schema:sdLicense https://scigraph.springernature.com/explorer/license/
33 schema:sdPublisher Nd6aaf79a2b8f42f6a27766ac66bb076a
34 schema:url https://link.springer.com/10.1186%2Fs12864-017-3998-6
35 sgo:license sg:explorer/license/
36 sgo:sdDataset articles
37 rdf:type schema:ScholarlyArticle
38 N010b1fded16c4aeab0cdb421a1fd14a3 schema:name dimensions_id
39 schema:value pub.1091167089
40 rdf:type schema:PropertyValue
41 N1a68f2bdb6e047458273368022fab841 schema:issueNumber 1
42 rdf:type schema:PublicationIssue
43 N1f5148559ab94e4daafa51814f51fcba schema:volumeNumber 18
44 rdf:type schema:PublicationVolume
45 N25cf023060714e8fb7e31f250d59fb15 schema:name doi
46 schema:value 10.1186/s12864-017-3998-6
47 rdf:type schema:PropertyValue
48 N399bfae7402a4d4a95e67074a428ac7a schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
49 schema:name Machine Learning
50 rdf:type schema:DefinedTerm
51 N4a850d32538143259f6daad4656f5831 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
52 schema:name Streptococcus pneumoniae
53 rdf:type schema:DefinedTerm
54 N5c12492391b84b3488a05c140a691d25 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
55 schema:name Serotyping
56 rdf:type schema:DefinedTerm
57 N6451b1735e194c05ae79ae02787b9a64 rdf:first sg:person.01132465512.22
58 rdf:rest rdf:nil
59 N6f226747ad894398b4bcf5b5176dc2c2 rdf:first sg:person.016650740112.04
60 rdf:rest N6451b1735e194c05ae79ae02787b9a64
61 N9760c0c0f7534699a25a5f094d4e4e3c schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
62 schema:name Models, Statistical
63 rdf:type schema:DefinedTerm
64 Na2ce48745d154877a51e828a5dd057b4 schema:name pubmed_id
65 schema:value 28800724
66 rdf:type schema:PropertyValue
67 Nafe3f59663ac4ca89796128b6c6a6b69 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
68 schema:name Oligonucleotide Array Sequence Analysis
69 rdf:type schema:DefinedTerm
70 Ncc42cf2ed6df42f5920605459ad6ab24 schema:name readcube_id
71 schema:value 26713ccf2386b6af4546bfdab01c456f9d394ea1107acc89892f9f2b73a90c11
72 rdf:type schema:PropertyValue
73 Nd6aaf79a2b8f42f6a27766ac66bb076a schema:name Springer Nature - SN SciGraph project
74 rdf:type schema:Organization
75 Ne174698a7c78422f9673fd63b439d907 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
76 schema:name Bayes Theorem
77 rdf:type schema:DefinedTerm
78 Ne99f76a807744e0ea04eaeeae2a70e59 schema:name nlm_unique_id
79 schema:value 100965258
80 rdf:type schema:PropertyValue
81 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
82 schema:name Information and Computing Sciences
83 rdf:type schema:DefinedTerm
84 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
85 schema:name Artificial Intelligence and Image Processing
86 rdf:type schema:DefinedTerm
87 sg:grant.2771293 http://pending.schema.org/fundedItem sg:pub.10.1186/s12864-017-3998-6
88 rdf:type schema:MonetaryGrant
89 sg:grant.7611026 http://pending.schema.org/fundedItem sg:pub.10.1186/s12864-017-3998-6
90 rdf:type schema:MonetaryGrant
91 sg:journal.1023790 schema:issn 1471-2164
92 schema:name BMC Genomics
93 rdf:type schema:Periodical
94 sg:person.01132465512.22 schema:affiliation https://www.grid.ac/institutes/grid.415038.b
95 schema:familyName Wernisch
96 schema:givenName Lorenz
97 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01132465512.22
98 rdf:type schema:Person
99 sg:person.016650740112.04 schema:affiliation https://www.grid.ac/institutes/grid.415038.b
100 schema:familyName Newton
101 schema:givenName Richard
102 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.016650740112.04
103 rdf:type schema:Person
104 sg:pub.10.1186/1471-2105-12-88 schema:sameAs https://app.dimensions.ai/details/publication/pub.1017262754
105 https://doi.org/10.1186/1471-2105-12-88
106 rdf:type schema:CreativeWork
107 https://doi.org/10.1214/aos/1013203451 schema:sameAs https://app.dimensions.ai/details/publication/pub.1030645893
108 rdf:type schema:CreativeWork
109 https://doi.org/10.1371/journal.pmed.1001903 schema:sameAs https://app.dimensions.ai/details/publication/pub.1050170479
110 rdf:type schema:CreativeWork
111 https://www.grid.ac/institutes/grid.415038.b schema:alternateName MRC Biostatistics Unit
112 schema:name MRC Biostatistics Unit, Robinson Way, CB2 0SR, Cambridge, UK
113 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...