Enhanced protein domain discovery using taxonomy View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2004-12

AUTHORS

Lachlan Coin, Alex Bateman, Richard Durbin

ABSTRACT

BACKGROUND: It is well known that different species have different protein domain repertoires, and indeed that some protein domains are kingdom specific. This information has not yet been incorporated into statistical methods for finding domains in sequences of amino acids. RESULTS: We show that by incorporating our understanding of the taxonomic distribution of specific protein domains, we can enhance domain recognition in protein sequences. We identify 4447 new instances of Pfam domains in the SP-TREMBL database using this technique, equivalent to the coverage increase given by the last 8.3% of Pfam families and to a 0.7% increase in the number of domain predictions. We use PSI-BLAST to cross-validate our new predictions. We also benchmark our approach using a SCOP test set of proteins of known structure, and demonstrate improvements relative to standard Hidden Markov model techniques. CONCLUSIONS: Explicitly including knowledge about the taxonomic distribution of protein domains can enhance protein domain recognition. Our method can also incorporate other context-specific domain distributions - such as domain co-occurrence and protein localisation. More... »

PAGES

56

Identifiers

URI

http://scigraph.springernature.com/pub.10.1186/1471-2105-5-56

DOI

http://dx.doi.org/10.1186/1471-2105-5-56

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1052785352

PUBMED

https://www.ncbi.nlm.nih.gov/pubmed/15137915


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0601", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Biochemistry and Cell Biology", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/06", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Biological Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Computational Biology", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Databases, Protein", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Models, Statistical", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Protein Structure, Tertiary", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Proteins", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Species Specificity", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "name": [
            "Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton, CambridgeUK"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Coin", 
        "givenName": "Lachlan", 
        "id": "sg:person.0713704267.53", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0713704267.53"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "name": [
            "Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton, CambridgeUK"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Bateman", 
        "givenName": "Alex", 
        "id": "sg:person.01253551753.58", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01253551753.58"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "name": [
            "Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton, CambridgeUK"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Durbin", 
        "givenName": "Richard", 
        "id": "sg:person.012246531224.10", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012246531224.10"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "https://doi.org/10.1073/pnas.0737502100", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1000030464"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/s0014-5793(00)01891-3", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1008793882"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1006/jmbi.1998.2221", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1010624753"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1172/jci118918", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1015191906"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1006/jmbi.1994.1104", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1016537913"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1073/pnas.92.11.4957", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1019060094"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1093/nar/gkh121", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1020798638"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1093/bioinformatics/14.9.755", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1024610917"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1093/nar/25.1.236", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1028119331"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1093/emboj/17.5.1192", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1028987051"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1093/nar/30.1.260", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1036197701"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1093/nar/25.17.3389", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1047265454"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1074/jbc.m000787200", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1050526794"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2004-12", 
    "datePublishedReg": "2004-12-01", 
    "description": "BACKGROUND: It is well known that different species have different protein domain repertoires, and indeed that some protein domains are kingdom specific. This information has not yet been incorporated into statistical methods for finding domains in sequences of amino acids.\nRESULTS: We show that by incorporating our understanding of the taxonomic distribution of specific protein domains, we can enhance domain recognition in protein sequences. We identify 4447 new instances of Pfam domains in the SP-TREMBL database using this technique, equivalent to the coverage increase given by the last 8.3% of Pfam families and to a 0.7% increase in the number of domain predictions. We use PSI-BLAST to cross-validate our new predictions. We also benchmark our approach using a SCOP test set of proteins of known structure, and demonstrate improvements relative to standard Hidden Markov model techniques.\nCONCLUSIONS: Explicitly including knowledge about the taxonomic distribution of protein domains can enhance protein domain recognition. Our method can also incorporate other context-specific domain distributions - such as domain co-occurrence and protein localisation.", 
    "genre": "research_article", 
    "id": "sg:pub.10.1186/1471-2105-5-56", 
    "inLanguage": [
      "en"
    ], 
    "isAccessibleForFree": true, 
    "isPartOf": [
      {
        "id": "sg:journal.1023786", 
        "issn": [
          "1471-2105"
        ], 
        "name": "BMC Bioinformatics", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "1", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "5"
      }
    ], 
    "name": "Enhanced protein domain discovery using taxonomy", 
    "pagination": "56", 
    "productId": [
      {
        "name": "readcube_id", 
        "type": "PropertyValue", 
        "value": [
          "fb6ee0ae50538401a6ca9141e478c2d385d55d4370f04847ed7809dda60aa8d6"
        ]
      }, 
      {
        "name": "pubmed_id", 
        "type": "PropertyValue", 
        "value": [
          "15137915"
        ]
      }, 
      {
        "name": "nlm_unique_id", 
        "type": "PropertyValue", 
        "value": [
          "100965194"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1186/1471-2105-5-56"
        ]
      }, 
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1052785352"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1186/1471-2105-5-56", 
      "https://app.dimensions.ai/details/publication/pub.1052785352"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2019-04-11T10:31", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000349_0000000349/records_113650_00000002.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "https://link.springer.com/10.1186%2F1471-2105-5-56"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-5-56'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-5-56'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-5-56'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-5-56'


 

This table displays all metadata directly associated to this object as RDF triples.

148 TRIPLES      21 PREDICATES      48 URIs      27 LITERALS      15 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1186/1471-2105-5-56 schema:about N5ec3630c5ff1490f82f980e74386b515
2 N88fd5792d548407dbf9c895d4b176517
3 N9b475e6eaf654bbe887a233d3dc28526
4 Na672111010634b8184978213400e0e31
5 Nf28fb28b272f4948b705c3bb53c19e8a
6 Nf42b5d67dcc14201ad5872747d7a3b61
7 anzsrc-for:06
8 anzsrc-for:0601
9 schema:author Nc483d17003d146caae46a6f64fb064b1
10 schema:citation https://doi.org/10.1006/jmbi.1994.1104
11 https://doi.org/10.1006/jmbi.1998.2221
12 https://doi.org/10.1016/s0014-5793(00)01891-3
13 https://doi.org/10.1073/pnas.0737502100
14 https://doi.org/10.1073/pnas.92.11.4957
15 https://doi.org/10.1074/jbc.m000787200
16 https://doi.org/10.1093/bioinformatics/14.9.755
17 https://doi.org/10.1093/emboj/17.5.1192
18 https://doi.org/10.1093/nar/25.1.236
19 https://doi.org/10.1093/nar/25.17.3389
20 https://doi.org/10.1093/nar/30.1.260
21 https://doi.org/10.1093/nar/gkh121
22 https://doi.org/10.1172/jci118918
23 schema:datePublished 2004-12
24 schema:datePublishedReg 2004-12-01
25 schema:description BACKGROUND: It is well known that different species have different protein domain repertoires, and indeed that some protein domains are kingdom specific. This information has not yet been incorporated into statistical methods for finding domains in sequences of amino acids. RESULTS: We show that by incorporating our understanding of the taxonomic distribution of specific protein domains, we can enhance domain recognition in protein sequences. We identify 4447 new instances of Pfam domains in the SP-TREMBL database using this technique, equivalent to the coverage increase given by the last 8.3% of Pfam families and to a 0.7% increase in the number of domain predictions. We use PSI-BLAST to cross-validate our new predictions. We also benchmark our approach using a SCOP test set of proteins of known structure, and demonstrate improvements relative to standard Hidden Markov model techniques. CONCLUSIONS: Explicitly including knowledge about the taxonomic distribution of protein domains can enhance protein domain recognition. Our method can also incorporate other context-specific domain distributions - such as domain co-occurrence and protein localisation.
26 schema:genre research_article
27 schema:inLanguage en
28 schema:isAccessibleForFree true
29 schema:isPartOf N4f1a0a519fa2490da1335e455a2cf4cc
30 Na6d2bd99023a48dbb8af53fb65d76426
31 sg:journal.1023786
32 schema:name Enhanced protein domain discovery using taxonomy
33 schema:pagination 56
34 schema:productId N2782743e4bab435e98d10e4b71592702
35 N64f34ed8c9e34953be17056c7e298f3c
36 Nb28084b3200d439390484d820999f7f3
37 Nc8d13e00975b425baaf85f702bb69964
38 Nfb37effa2da74eb692c4aa3f12fafefb
39 schema:sameAs https://app.dimensions.ai/details/publication/pub.1052785352
40 https://doi.org/10.1186/1471-2105-5-56
41 schema:sdDatePublished 2019-04-11T10:31
42 schema:sdLicense https://scigraph.springernature.com/explorer/license/
43 schema:sdPublisher N712da06e283f414bac83a1e800b9d31d
44 schema:url https://link.springer.com/10.1186%2F1471-2105-5-56
45 sgo:license sg:explorer/license/
46 sgo:sdDataset articles
47 rdf:type schema:ScholarlyArticle
48 N01510bd9b7124ec9b51d06301e1672e4 schema:name Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton, CambridgeUK
49 rdf:type schema:Organization
50 N1f267cb6815f4eaca563ff7beb38bc2c schema:name Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton, CambridgeUK
51 rdf:type schema:Organization
52 N2782743e4bab435e98d10e4b71592702 schema:name pubmed_id
53 schema:value 15137915
54 rdf:type schema:PropertyValue
55 N30dfbca7dd174beb992fc1143bb7cee8 rdf:first sg:person.01253551753.58
56 rdf:rest Nb67a7d63260f4eb398f808536d3e3ba9
57 N4f1a0a519fa2490da1335e455a2cf4cc schema:volumeNumber 5
58 rdf:type schema:PublicationVolume
59 N5ec3630c5ff1490f82f980e74386b515 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
60 schema:name Species Specificity
61 rdf:type schema:DefinedTerm
62 N64f34ed8c9e34953be17056c7e298f3c schema:name doi
63 schema:value 10.1186/1471-2105-5-56
64 rdf:type schema:PropertyValue
65 N712da06e283f414bac83a1e800b9d31d schema:name Springer Nature - SN SciGraph project
66 rdf:type schema:Organization
67 N784f244c3c0c4773878a6ebfbd2d07c2 schema:name Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton, CambridgeUK
68 rdf:type schema:Organization
69 N88fd5792d548407dbf9c895d4b176517 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
70 schema:name Protein Structure, Tertiary
71 rdf:type schema:DefinedTerm
72 N9b475e6eaf654bbe887a233d3dc28526 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
73 schema:name Databases, Protein
74 rdf:type schema:DefinedTerm
75 Na672111010634b8184978213400e0e31 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
76 schema:name Proteins
77 rdf:type schema:DefinedTerm
78 Na6d2bd99023a48dbb8af53fb65d76426 schema:issueNumber 1
79 rdf:type schema:PublicationIssue
80 Nb28084b3200d439390484d820999f7f3 schema:name readcube_id
81 schema:value fb6ee0ae50538401a6ca9141e478c2d385d55d4370f04847ed7809dda60aa8d6
82 rdf:type schema:PropertyValue
83 Nb67a7d63260f4eb398f808536d3e3ba9 rdf:first sg:person.012246531224.10
84 rdf:rest rdf:nil
85 Nc483d17003d146caae46a6f64fb064b1 rdf:first sg:person.0713704267.53
86 rdf:rest N30dfbca7dd174beb992fc1143bb7cee8
87 Nc8d13e00975b425baaf85f702bb69964 schema:name dimensions_id
88 schema:value pub.1052785352
89 rdf:type schema:PropertyValue
90 Nf28fb28b272f4948b705c3bb53c19e8a schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
91 schema:name Models, Statistical
92 rdf:type schema:DefinedTerm
93 Nf42b5d67dcc14201ad5872747d7a3b61 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
94 schema:name Computational Biology
95 rdf:type schema:DefinedTerm
96 Nfb37effa2da74eb692c4aa3f12fafefb schema:name nlm_unique_id
97 schema:value 100965194
98 rdf:type schema:PropertyValue
99 anzsrc-for:06 schema:inDefinedTermSet anzsrc-for:
100 schema:name Biological Sciences
101 rdf:type schema:DefinedTerm
102 anzsrc-for:0601 schema:inDefinedTermSet anzsrc-for:
103 schema:name Biochemistry and Cell Biology
104 rdf:type schema:DefinedTerm
105 sg:journal.1023786 schema:issn 1471-2105
106 schema:name BMC Bioinformatics
107 rdf:type schema:Periodical
108 sg:person.012246531224.10 schema:affiliation N01510bd9b7124ec9b51d06301e1672e4
109 schema:familyName Durbin
110 schema:givenName Richard
111 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012246531224.10
112 rdf:type schema:Person
113 sg:person.01253551753.58 schema:affiliation N784f244c3c0c4773878a6ebfbd2d07c2
114 schema:familyName Bateman
115 schema:givenName Alex
116 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01253551753.58
117 rdf:type schema:Person
118 sg:person.0713704267.53 schema:affiliation N1f267cb6815f4eaca563ff7beb38bc2c
119 schema:familyName Coin
120 schema:givenName Lachlan
121 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0713704267.53
122 rdf:type schema:Person
123 https://doi.org/10.1006/jmbi.1994.1104 schema:sameAs https://app.dimensions.ai/details/publication/pub.1016537913
124 rdf:type schema:CreativeWork
125 https://doi.org/10.1006/jmbi.1998.2221 schema:sameAs https://app.dimensions.ai/details/publication/pub.1010624753
126 rdf:type schema:CreativeWork
127 https://doi.org/10.1016/s0014-5793(00)01891-3 schema:sameAs https://app.dimensions.ai/details/publication/pub.1008793882
128 rdf:type schema:CreativeWork
129 https://doi.org/10.1073/pnas.0737502100 schema:sameAs https://app.dimensions.ai/details/publication/pub.1000030464
130 rdf:type schema:CreativeWork
131 https://doi.org/10.1073/pnas.92.11.4957 schema:sameAs https://app.dimensions.ai/details/publication/pub.1019060094
132 rdf:type schema:CreativeWork
133 https://doi.org/10.1074/jbc.m000787200 schema:sameAs https://app.dimensions.ai/details/publication/pub.1050526794
134 rdf:type schema:CreativeWork
135 https://doi.org/10.1093/bioinformatics/14.9.755 schema:sameAs https://app.dimensions.ai/details/publication/pub.1024610917
136 rdf:type schema:CreativeWork
137 https://doi.org/10.1093/emboj/17.5.1192 schema:sameAs https://app.dimensions.ai/details/publication/pub.1028987051
138 rdf:type schema:CreativeWork
139 https://doi.org/10.1093/nar/25.1.236 schema:sameAs https://app.dimensions.ai/details/publication/pub.1028119331
140 rdf:type schema:CreativeWork
141 https://doi.org/10.1093/nar/25.17.3389 schema:sameAs https://app.dimensions.ai/details/publication/pub.1047265454
142 rdf:type schema:CreativeWork
143 https://doi.org/10.1093/nar/30.1.260 schema:sameAs https://app.dimensions.ai/details/publication/pub.1036197701
144 rdf:type schema:CreativeWork
145 https://doi.org/10.1093/nar/gkh121 schema:sameAs https://app.dimensions.ai/details/publication/pub.1020798638
146 rdf:type schema:CreativeWork
147 https://doi.org/10.1172/jci118918 schema:sameAs https://app.dimensions.ai/details/publication/pub.1015191906
148 rdf:type schema:CreativeWork
 




Preview window. Press ESC to close (or click here)


...