Enhanced protein domain discovery using taxonomy View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2004-12

AUTHORS

Lachlan Coin, Alex Bateman, Richard Durbin

ABSTRACT

BACKGROUND: It is well known that different species have different protein domain repertoires, and indeed that some protein domains are kingdom specific. This information has not yet been incorporated into statistical methods for finding domains in sequences of amino acids. RESULTS: We show that by incorporating our understanding of the taxonomic distribution of specific protein domains, we can enhance domain recognition in protein sequences. We identify 4447 new instances of Pfam domains in the SP-TREMBL database using this technique, equivalent to the coverage increase given by the last 8.3% of Pfam families and to a 0.7% increase in the number of domain predictions. We use PSI-BLAST to cross-validate our new predictions. We also benchmark our approach using a SCOP test set of proteins of known structure, and demonstrate improvements relative to standard Hidden Markov model techniques. CONCLUSIONS: Explicitly including knowledge about the taxonomic distribution of protein domains can enhance protein domain recognition. Our method can also incorporate other context-specific domain distributions - such as domain co-occurrence and protein localisation. More... »

PAGES

56

Identifiers

URI

http://scigraph.springernature.com/pub.10.1186/1471-2105-5-56

DOI

http://dx.doi.org/10.1186/1471-2105-5-56

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1052785352

PUBMED

https://www.ncbi.nlm.nih.gov/pubmed/15137915


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0601", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Biochemistry and Cell Biology", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/06", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Biological Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Computational Biology", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Databases, Protein", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Models, Statistical", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Protein Structure, Tertiary", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Proteins", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Species Specificity", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "name": [
            "Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton, CambridgeUK"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Coin", 
        "givenName": "Lachlan", 
        "id": "sg:person.0713704267.53", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0713704267.53"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "name": [
            "Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton, CambridgeUK"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Bateman", 
        "givenName": "Alex", 
        "id": "sg:person.01253551753.58", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01253551753.58"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "name": [
            "Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton, CambridgeUK"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Durbin", 
        "givenName": "Richard", 
        "id": "sg:person.012246531224.10", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012246531224.10"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "https://doi.org/10.1073/pnas.0737502100", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1000030464"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/s0014-5793(00)01891-3", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1008793882"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1006/jmbi.1998.2221", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1010624753"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1172/jci118918", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1015191906"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1006/jmbi.1994.1104", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1016537913"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1073/pnas.92.11.4957", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1019060094"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1093/nar/gkh121", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1020798638"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1093/bioinformatics/14.9.755", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1024610917"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1093/nar/25.1.236", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1028119331"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1093/emboj/17.5.1192", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1028987051"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1093/nar/30.1.260", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1036197701"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1093/nar/25.17.3389", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1047265454"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1074/jbc.m000787200", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1050526794"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2004-12", 
    "datePublishedReg": "2004-12-01", 
    "description": "BACKGROUND: It is well known that different species have different protein domain repertoires, and indeed that some protein domains are kingdom specific. This information has not yet been incorporated into statistical methods for finding domains in sequences of amino acids.\nRESULTS: We show that by incorporating our understanding of the taxonomic distribution of specific protein domains, we can enhance domain recognition in protein sequences. We identify 4447 new instances of Pfam domains in the SP-TREMBL database using this technique, equivalent to the coverage increase given by the last 8.3% of Pfam families and to a 0.7% increase in the number of domain predictions. We use PSI-BLAST to cross-validate our new predictions. We also benchmark our approach using a SCOP test set of proteins of known structure, and demonstrate improvements relative to standard Hidden Markov model techniques.\nCONCLUSIONS: Explicitly including knowledge about the taxonomic distribution of protein domains can enhance protein domain recognition. Our method can also incorporate other context-specific domain distributions - such as domain co-occurrence and protein localisation.", 
    "genre": "research_article", 
    "id": "sg:pub.10.1186/1471-2105-5-56", 
    "inLanguage": [
      "en"
    ], 
    "isAccessibleForFree": true, 
    "isPartOf": [
      {
        "id": "sg:journal.1023786", 
        "issn": [
          "1471-2105"
        ], 
        "name": "BMC Bioinformatics", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "1", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "5"
      }
    ], 
    "name": "Enhanced protein domain discovery using taxonomy", 
    "pagination": "56", 
    "productId": [
      {
        "name": "readcube_id", 
        "type": "PropertyValue", 
        "value": [
          "fb6ee0ae50538401a6ca9141e478c2d385d55d4370f04847ed7809dda60aa8d6"
        ]
      }, 
      {
        "name": "pubmed_id", 
        "type": "PropertyValue", 
        "value": [
          "15137915"
        ]
      }, 
      {
        "name": "nlm_unique_id", 
        "type": "PropertyValue", 
        "value": [
          "100965194"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1186/1471-2105-5-56"
        ]
      }, 
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1052785352"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1186/1471-2105-5-56", 
      "https://app.dimensions.ai/details/publication/pub.1052785352"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2019-04-11T10:31", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000349_0000000349/records_113650_00000002.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "https://link.springer.com/10.1186%2F1471-2105-5-56"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-5-56'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-5-56'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-5-56'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-5-56'


 

This table displays all metadata directly associated to this object as RDF triples.

148 TRIPLES      21 PREDICATES      48 URIs      27 LITERALS      15 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1186/1471-2105-5-56 schema:about N0db5ee1bd4284647a729ba4a85aabbde
2 N20e59ff296de491a899aafde2238c742
3 N38747017539a460ca1bbb52a46ea0074
4 N59d5a4cfbdd74f0788fc3c6e4119ccf2
5 Na76d4527e40549eabe46b3d1e4a76c1b
6 Ne7f25903415a4770b12c7351c62a1b56
7 anzsrc-for:06
8 anzsrc-for:0601
9 schema:author Nbbdba853e2db4052bbcc71b3347de30f
10 schema:citation https://doi.org/10.1006/jmbi.1994.1104
11 https://doi.org/10.1006/jmbi.1998.2221
12 https://doi.org/10.1016/s0014-5793(00)01891-3
13 https://doi.org/10.1073/pnas.0737502100
14 https://doi.org/10.1073/pnas.92.11.4957
15 https://doi.org/10.1074/jbc.m000787200
16 https://doi.org/10.1093/bioinformatics/14.9.755
17 https://doi.org/10.1093/emboj/17.5.1192
18 https://doi.org/10.1093/nar/25.1.236
19 https://doi.org/10.1093/nar/25.17.3389
20 https://doi.org/10.1093/nar/30.1.260
21 https://doi.org/10.1093/nar/gkh121
22 https://doi.org/10.1172/jci118918
23 schema:datePublished 2004-12
24 schema:datePublishedReg 2004-12-01
25 schema:description BACKGROUND: It is well known that different species have different protein domain repertoires, and indeed that some protein domains are kingdom specific. This information has not yet been incorporated into statistical methods for finding domains in sequences of amino acids. RESULTS: We show that by incorporating our understanding of the taxonomic distribution of specific protein domains, we can enhance domain recognition in protein sequences. We identify 4447 new instances of Pfam domains in the SP-TREMBL database using this technique, equivalent to the coverage increase given by the last 8.3% of Pfam families and to a 0.7% increase in the number of domain predictions. We use PSI-BLAST to cross-validate our new predictions. We also benchmark our approach using a SCOP test set of proteins of known structure, and demonstrate improvements relative to standard Hidden Markov model techniques. CONCLUSIONS: Explicitly including knowledge about the taxonomic distribution of protein domains can enhance protein domain recognition. Our method can also incorporate other context-specific domain distributions - such as domain co-occurrence and protein localisation.
26 schema:genre research_article
27 schema:inLanguage en
28 schema:isAccessibleForFree true
29 schema:isPartOf N2543fbc654d74faebfa2d849cc949fc1
30 Nb04a10c51bb14e06b534675572fb242b
31 sg:journal.1023786
32 schema:name Enhanced protein domain discovery using taxonomy
33 schema:pagination 56
34 schema:productId N12304c5e8441460dbcd249c0ae65fa8b
35 N22f83ee3a3be4cb683b4358b8ae08331
36 N670a8b450cbe4d9c9b9bf0fee7997a3f
37 N71c0dc652b824b7594a0260e74b76394
38 N90cc683ea2f744b3b0ce18fbfb778cfc
39 schema:sameAs https://app.dimensions.ai/details/publication/pub.1052785352
40 https://doi.org/10.1186/1471-2105-5-56
41 schema:sdDatePublished 2019-04-11T10:31
42 schema:sdLicense https://scigraph.springernature.com/explorer/license/
43 schema:sdPublisher Ne38f6b324b9f4dcb902089dc8bf9e45b
44 schema:url https://link.springer.com/10.1186%2F1471-2105-5-56
45 sgo:license sg:explorer/license/
46 sgo:sdDataset articles
47 rdf:type schema:ScholarlyArticle
48 N07838426a9b441adb8c435478398fedb rdf:first sg:person.012246531224.10
49 rdf:rest rdf:nil
50 N0db5ee1bd4284647a729ba4a85aabbde schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
51 schema:name Databases, Protein
52 rdf:type schema:DefinedTerm
53 N12304c5e8441460dbcd249c0ae65fa8b schema:name doi
54 schema:value 10.1186/1471-2105-5-56
55 rdf:type schema:PropertyValue
56 N20e59ff296de491a899aafde2238c742 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
57 schema:name Models, Statistical
58 rdf:type schema:DefinedTerm
59 N22f83ee3a3be4cb683b4358b8ae08331 schema:name readcube_id
60 schema:value fb6ee0ae50538401a6ca9141e478c2d385d55d4370f04847ed7809dda60aa8d6
61 rdf:type schema:PropertyValue
62 N2543fbc654d74faebfa2d849cc949fc1 schema:issueNumber 1
63 rdf:type schema:PublicationIssue
64 N38747017539a460ca1bbb52a46ea0074 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
65 schema:name Computational Biology
66 rdf:type schema:DefinedTerm
67 N59d5a4cfbdd74f0788fc3c6e4119ccf2 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
68 schema:name Protein Structure, Tertiary
69 rdf:type schema:DefinedTerm
70 N670a8b450cbe4d9c9b9bf0fee7997a3f schema:name nlm_unique_id
71 schema:value 100965194
72 rdf:type schema:PropertyValue
73 N68d527c4bc4b4648a2b1e8d8a6d10dba rdf:first sg:person.01253551753.58
74 rdf:rest N07838426a9b441adb8c435478398fedb
75 N71c0dc652b824b7594a0260e74b76394 schema:name dimensions_id
76 schema:value pub.1052785352
77 rdf:type schema:PropertyValue
78 N77e2a46da3224611ac3e0b05e35ff84d schema:name Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton, CambridgeUK
79 rdf:type schema:Organization
80 N7cdf5db7891242d985d268b992eb87f6 schema:name Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton, CambridgeUK
81 rdf:type schema:Organization
82 N90cc683ea2f744b3b0ce18fbfb778cfc schema:name pubmed_id
83 schema:value 15137915
84 rdf:type schema:PropertyValue
85 Na76d4527e40549eabe46b3d1e4a76c1b schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
86 schema:name Species Specificity
87 rdf:type schema:DefinedTerm
88 Nb04a10c51bb14e06b534675572fb242b schema:volumeNumber 5
89 rdf:type schema:PublicationVolume
90 Nb77064f927c44dddb9a27a4644503a16 schema:name Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton, CambridgeUK
91 rdf:type schema:Organization
92 Nbbdba853e2db4052bbcc71b3347de30f rdf:first sg:person.0713704267.53
93 rdf:rest N68d527c4bc4b4648a2b1e8d8a6d10dba
94 Ne38f6b324b9f4dcb902089dc8bf9e45b schema:name Springer Nature - SN SciGraph project
95 rdf:type schema:Organization
96 Ne7f25903415a4770b12c7351c62a1b56 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
97 schema:name Proteins
98 rdf:type schema:DefinedTerm
99 anzsrc-for:06 schema:inDefinedTermSet anzsrc-for:
100 schema:name Biological Sciences
101 rdf:type schema:DefinedTerm
102 anzsrc-for:0601 schema:inDefinedTermSet anzsrc-for:
103 schema:name Biochemistry and Cell Biology
104 rdf:type schema:DefinedTerm
105 sg:journal.1023786 schema:issn 1471-2105
106 schema:name BMC Bioinformatics
107 rdf:type schema:Periodical
108 sg:person.012246531224.10 schema:affiliation Nb77064f927c44dddb9a27a4644503a16
109 schema:familyName Durbin
110 schema:givenName Richard
111 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012246531224.10
112 rdf:type schema:Person
113 sg:person.01253551753.58 schema:affiliation N77e2a46da3224611ac3e0b05e35ff84d
114 schema:familyName Bateman
115 schema:givenName Alex
116 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01253551753.58
117 rdf:type schema:Person
118 sg:person.0713704267.53 schema:affiliation N7cdf5db7891242d985d268b992eb87f6
119 schema:familyName Coin
120 schema:givenName Lachlan
121 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0713704267.53
122 rdf:type schema:Person
123 https://doi.org/10.1006/jmbi.1994.1104 schema:sameAs https://app.dimensions.ai/details/publication/pub.1016537913
124 rdf:type schema:CreativeWork
125 https://doi.org/10.1006/jmbi.1998.2221 schema:sameAs https://app.dimensions.ai/details/publication/pub.1010624753
126 rdf:type schema:CreativeWork
127 https://doi.org/10.1016/s0014-5793(00)01891-3 schema:sameAs https://app.dimensions.ai/details/publication/pub.1008793882
128 rdf:type schema:CreativeWork
129 https://doi.org/10.1073/pnas.0737502100 schema:sameAs https://app.dimensions.ai/details/publication/pub.1000030464
130 rdf:type schema:CreativeWork
131 https://doi.org/10.1073/pnas.92.11.4957 schema:sameAs https://app.dimensions.ai/details/publication/pub.1019060094
132 rdf:type schema:CreativeWork
133 https://doi.org/10.1074/jbc.m000787200 schema:sameAs https://app.dimensions.ai/details/publication/pub.1050526794
134 rdf:type schema:CreativeWork
135 https://doi.org/10.1093/bioinformatics/14.9.755 schema:sameAs https://app.dimensions.ai/details/publication/pub.1024610917
136 rdf:type schema:CreativeWork
137 https://doi.org/10.1093/emboj/17.5.1192 schema:sameAs https://app.dimensions.ai/details/publication/pub.1028987051
138 rdf:type schema:CreativeWork
139 https://doi.org/10.1093/nar/25.1.236 schema:sameAs https://app.dimensions.ai/details/publication/pub.1028119331
140 rdf:type schema:CreativeWork
141 https://doi.org/10.1093/nar/25.17.3389 schema:sameAs https://app.dimensions.ai/details/publication/pub.1047265454
142 rdf:type schema:CreativeWork
143 https://doi.org/10.1093/nar/30.1.260 schema:sameAs https://app.dimensions.ai/details/publication/pub.1036197701
144 rdf:type schema:CreativeWork
145 https://doi.org/10.1093/nar/gkh121 schema:sameAs https://app.dimensions.ai/details/publication/pub.1020798638
146 rdf:type schema:CreativeWork
147 https://doi.org/10.1172/jci118918 schema:sameAs https://app.dimensions.ai/details/publication/pub.1015191906
148 rdf:type schema:CreativeWork
 




Preview window. Press ESC to close (or click here)


...