Fold classification based on secondary structure – how much is gained by including loop topology? View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2006-03-08

AUTHORS

Jieun Jeong, Piotr Berman, Teresa Przytycka

ABSTRACT

BackgroundIt has been proposed that secondary structure information can be used to classify (to some extend) protein folds. Since this method utilizes very limited information about the protein structure, it is not surprising that it has a higher error rate than the approaches that use full 3D fold description. On the other hand, the comparing of 3D protein structures is computing intensive. This raises the question to what extend the error rate can be decreased with each new source of information, especially if the new information can still be used with simple alignment algorithms.We consider the question whether the information about closed loops can improve the accuracy of this approach. While the answer appears to be obvious, we had to overcome two challenges. First, how to code and to compare topological information in such a way that local alignment of strings will properly identify similar structures. Second, how to properly measure the effect of new information in a large data sample.We investigate alternative ways of computing and presenting this information.ResultsWe used the set of beta proteins with at most 30% pairwise identity to test the approach; local alignment scores were used to build a tree of clusters which was evaluated using a new log-odd cluster scoring function. In particular, we derive a closed formula for the probability of obtaining a given score by chance.Parameters of local alignment function were optimized using a genetic algorithm.Of 81 folds that had more than one representative in our data set, log-odds scores registered significantly better clustering in 27 cases and significantly worse in 6 cases, and small differences in the remaining cases. Various notions of the significant change or average change were considered and tried, and the results were all pointing in the same direction.ConclusionWe found that, on average, properly presented information about the loop topology improves noticeably the accuracy of the method but the benefits vary between fold families as measured by log-odds cluster score. More... »

PAGES

3

References to SciGraph publications

  • 2001-11. Identification of homology in protein structure classification in NATURE STRUCTURAL & MOLECULAR BIOLOGY
  • 1999-07. A protein taxonomy based on secondary structure in NATURE STRUCTURAL & MOLECULAR BIOLOGY
  • Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1186/1472-6807-6-3

    DOI

    http://dx.doi.org/10.1186/1472-6807-6-3

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1033549508

    PUBMED

    https://www.ncbi.nlm.nih.gov/pubmed/16524467


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/06", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Biological Sciences", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0601", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Biochemistry and Cell Biology", 
            "type": "DefinedTerm"
          }, 
          {
            "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
            "name": "Algorithms", 
            "type": "DefinedTerm"
          }, 
          {
            "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
            "name": "Amino Acid Sequence", 
            "type": "DefinedTerm"
          }, 
          {
            "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
            "name": "Cluster Analysis", 
            "type": "DefinedTerm"
          }, 
          {
            "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
            "name": "Protein Folding", 
            "type": "DefinedTerm"
          }, 
          {
            "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
            "name": "Protein Structure, Secondary", 
            "type": "DefinedTerm"
          }, 
          {
            "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
            "name": "Proteins", 
            "type": "DefinedTerm"
          }, 
          {
            "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
            "name": "Sequence Alignment", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA", 
              "id": "http://www.grid.ac/institutes/grid.29857.31", 
              "name": [
                "Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Jeong", 
            "givenName": "Jieun", 
            "id": "sg:person.0645500667.93", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0645500667.93"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA", 
              "id": "http://www.grid.ac/institutes/grid.29857.31", 
              "name": [
                "Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Berman", 
            "givenName": "Piotr", 
            "id": "sg:person.01274506210.27", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01274506210.27"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, Bethesda, USA", 
              "id": "http://www.grid.ac/institutes/grid.419234.9", 
              "name": [
                "National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, Bethesda, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Przytycka", 
            "givenName": "Teresa", 
            "id": "sg:person.01325035263.95", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01325035263.95"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "sg:pub.10.1038/10728", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1031624707", 
              "https://doi.org/10.1038/10728"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1038/nsb1101-953", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1037147119", 
              "https://doi.org/10.1038/nsb1101-953"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2006-03-08", 
        "datePublishedReg": "2006-03-08", 
        "description": "BackgroundIt has been proposed that secondary structure information can be used to classify (to some extend) protein folds. Since this method utilizes very limited information about the protein structure, it is not surprising that it has a higher error rate than the approaches that use full 3D fold description. On the other hand, the comparing of 3D protein structures is computing intensive. This raises the question to what extend the error rate can be decreased with each new source of information, especially if the new information can still be used with simple alignment algorithms.We consider the question whether the information about closed loops can improve the accuracy of this approach. While the answer appears to be obvious, we had to overcome two challenges. First, how to code and to compare topological information in such a way that local alignment of strings will properly identify similar structures. Second, how to properly measure the effect of new information in a large data sample.We investigate alternative ways of computing and presenting this information.ResultsWe used the set of beta proteins with at most 30% pairwise identity to test the approach; local alignment scores were used to build a tree of clusters which was evaluated using a new log-odd cluster scoring function. In particular, we derive a closed formula for the probability of obtaining a given score by chance.Parameters of local alignment function were optimized using a genetic algorithm.Of 81 folds that had more than one representative in our data set, log-odds scores registered significantly better clustering in 27 cases and significantly worse in 6 cases, and small differences in the remaining cases. Various notions of the significant change or average change were considered and tried, and the results were all pointing in the same direction.ConclusionWe found that, on average, properly presented information about the loop topology improves noticeably the accuracy of the method but the benefits vary between fold families as measured by log-odds cluster score.", 
        "genre": "article", 
        "id": "sg:pub.10.1186/1472-6807-6-3", 
        "isAccessibleForFree": true, 
        "isFundedItemOf": [
          {
            "id": "sg:grant.2529090", 
            "type": "MonetaryGrant"
          }, 
          {
            "id": "sg:grant.2720312", 
            "type": "MonetaryGrant"
          }, 
          {
            "id": "sg:grant.3027409", 
            "type": "MonetaryGrant"
          }
        ], 
        "isPartOf": [
          {
            "id": "sg:journal.1024246", 
            "issn": [
              "2314-4343", 
              "2661-8850"
            ], 
            "name": "BMC Molecular and Cell Biology", 
            "publisher": "Springer Nature", 
            "type": "Periodical"
          }, 
          {
            "issueNumber": "1", 
            "type": "PublicationIssue"
          }, 
          {
            "type": "PublicationVolume", 
            "volumeNumber": "6"
          }
        ], 
        "keywords": [
          "simple alignment algorithms", 
          "tree of clusters", 
          "error rate", 
          "local alignment scores", 
          "alignment algorithm", 
          "topological information", 
          "high error rate", 
          "genetic algorithm", 
          "local alignment", 
          "secondary structure information", 
          "alignment scores", 
          "structure information", 
          "alignment function", 
          "data sets", 
          "data samples", 
          "scoring functions", 
          "algorithm", 
          "large data samples", 
          "information", 
          "loop topology", 
          "accuracy", 
          "topology", 
          "fold description", 
          "set", 
          "new information", 
          "classification", 
          "log-odds scores", 
          "way", 
          "alternative way", 
          "method", 
          "closed loop", 
          "challenges", 
          "trees", 
          "new sources", 
          "string", 
          "protein structure", 
          "alignment", 
          "answers", 
          "clusters", 
          "protein folds", 
          "comparing", 
          "description", 
          "notion", 
          "probability", 
          "benefits", 
          "closed formula", 
          "structure", 
          "hand", 
          "function", 
          "limited information", 
          "similar structure", 
          "questions", 
          "results", 
          "cases", 
          "direction", 
          "loop", 
          "parameters", 
          "chance", 
          "source", 
          "pairwise identity", 
          "scores", 
          "fold family", 
          "rate", 
          "identity", 
          "cluster scores", 
          "formula", 
          "representatives", 
          "secondary structure", 
          "same direction", 
          "changes", 
          "ConclusionWe", 
          "folds", 
          "ResultsWe", 
          "small differences", 
          "family", 
          "samples", 
          "differences", 
          "significant changes", 
          "effect", 
          "average change", 
          "approach", 
          "beta protein", 
          "protein", 
          "BackgroundIt"
        ], 
        "name": "Fold classification based on secondary structure \u2013 how much is gained by including loop topology?", 
        "pagination": "3", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1033549508"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1186/1472-6807-6-3"
            ]
          }, 
          {
            "name": "pubmed_id", 
            "type": "PropertyValue", 
            "value": [
              "16524467"
            ]
          }
        ], 
        "sameAs": [
          "https://doi.org/10.1186/1472-6807-6-3", 
          "https://app.dimensions.ai/details/publication/pub.1033549508"
        ], 
        "sdDataset": "articles", 
        "sdDatePublished": "2022-08-04T16:55", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-springernature-scigraph/baseset/20220804/entities/gbq_results/article/article_422.jsonl", 
        "type": "ScholarlyArticle", 
        "url": "https://doi.org/10.1186/1472-6807-6-3"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1186/1472-6807-6-3'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1186/1472-6807-6-3'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1186/1472-6807-6-3'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1186/1472-6807-6-3'


     

    This table displays all metadata directly associated to this object as RDF triples.

    204 TRIPLES      21 PREDICATES      118 URIs      108 LITERALS      14 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1186/1472-6807-6-3 schema:about N2c37c75843c74d67a26d3202d6378b13
    2 N6ab015de7bce4345ac04c4a7b6da4c10
    3 N6e4e4451e0444309b76b56fa3358743e
    4 N7575b7c38495420ba4654bc74a85d148
    5 N96d55d15a5644afea0a1abf48e54e3de
    6 Na8b6869e00194ff5b9a18e3408a4047a
    7 Ne32de8d7b11e4fdcbb2b28bd15d43530
    8 anzsrc-for:06
    9 anzsrc-for:0601
    10 schema:author N619eb0cd42314625afdae0142ec0b3d4
    11 schema:citation sg:pub.10.1038/10728
    12 sg:pub.10.1038/nsb1101-953
    13 schema:datePublished 2006-03-08
    14 schema:datePublishedReg 2006-03-08
    15 schema:description BackgroundIt has been proposed that secondary structure information can be used to classify (to some extend) protein folds. Since this method utilizes very limited information about the protein structure, it is not surprising that it has a higher error rate than the approaches that use full 3D fold description. On the other hand, the comparing of 3D protein structures is computing intensive. This raises the question to what extend the error rate can be decreased with each new source of information, especially if the new information can still be used with simple alignment algorithms.We consider the question whether the information about closed loops can improve the accuracy of this approach. While the answer appears to be obvious, we had to overcome two challenges. First, how to code and to compare topological information in such a way that local alignment of strings will properly identify similar structures. Second, how to properly measure the effect of new information in a large data sample.We investigate alternative ways of computing and presenting this information.ResultsWe used the set of beta proteins with at most 30% pairwise identity to test the approach; local alignment scores were used to build a tree of clusters which was evaluated using a new log-odd cluster scoring function. In particular, we derive a closed formula for the probability of obtaining a given score by chance.Parameters of local alignment function were optimized using a genetic algorithm.Of 81 folds that had more than one representative in our data set, log-odds scores registered significantly better clustering in 27 cases and significantly worse in 6 cases, and small differences in the remaining cases. Various notions of the significant change or average change were considered and tried, and the results were all pointing in the same direction.ConclusionWe found that, on average, properly presented information about the loop topology improves noticeably the accuracy of the method but the benefits vary between fold families as measured by log-odds cluster score.
    16 schema:genre article
    17 schema:isAccessibleForFree true
    18 schema:isPartOf Ncd2f65f435ea42fbaec5038b282c0fbe
    19 Nd0df4152a0b941298e4e87d0267bb17d
    20 sg:journal.1024246
    21 schema:keywords BackgroundIt
    22 ConclusionWe
    23 ResultsWe
    24 accuracy
    25 algorithm
    26 alignment
    27 alignment algorithm
    28 alignment function
    29 alignment scores
    30 alternative way
    31 answers
    32 approach
    33 average change
    34 benefits
    35 beta protein
    36 cases
    37 challenges
    38 chance
    39 changes
    40 classification
    41 closed formula
    42 closed loop
    43 cluster scores
    44 clusters
    45 comparing
    46 data samples
    47 data sets
    48 description
    49 differences
    50 direction
    51 effect
    52 error rate
    53 family
    54 fold description
    55 fold family
    56 folds
    57 formula
    58 function
    59 genetic algorithm
    60 hand
    61 high error rate
    62 identity
    63 information
    64 large data samples
    65 limited information
    66 local alignment
    67 local alignment scores
    68 log-odds scores
    69 loop
    70 loop topology
    71 method
    72 new information
    73 new sources
    74 notion
    75 pairwise identity
    76 parameters
    77 probability
    78 protein
    79 protein folds
    80 protein structure
    81 questions
    82 rate
    83 representatives
    84 results
    85 same direction
    86 samples
    87 scores
    88 scoring functions
    89 secondary structure
    90 secondary structure information
    91 set
    92 significant changes
    93 similar structure
    94 simple alignment algorithms
    95 small differences
    96 source
    97 string
    98 structure
    99 structure information
    100 topological information
    101 topology
    102 tree of clusters
    103 trees
    104 way
    105 schema:name Fold classification based on secondary structure – how much is gained by including loop topology?
    106 schema:pagination 3
    107 schema:productId N4310f689b7f045bfb63169334a50bbe7
    108 N6b9d826396374e1dba4ae79cb291b3da
    109 N77c3a8d9ddfa4d129565f8b94418c737
    110 schema:sameAs https://app.dimensions.ai/details/publication/pub.1033549508
    111 https://doi.org/10.1186/1472-6807-6-3
    112 schema:sdDatePublished 2022-08-04T16:55
    113 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    114 schema:sdPublisher Na2deabe760a7406aa80f5a349e8f3b23
    115 schema:url https://doi.org/10.1186/1472-6807-6-3
    116 sgo:license sg:explorer/license/
    117 sgo:sdDataset articles
    118 rdf:type schema:ScholarlyArticle
    119 N02e18447d01049b7a076d197dbd00646 rdf:first sg:person.01325035263.95
    120 rdf:rest rdf:nil
    121 N2c37c75843c74d67a26d3202d6378b13 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
    122 schema:name Protein Structure, Secondary
    123 rdf:type schema:DefinedTerm
    124 N4310f689b7f045bfb63169334a50bbe7 schema:name pubmed_id
    125 schema:value 16524467
    126 rdf:type schema:PropertyValue
    127 N619eb0cd42314625afdae0142ec0b3d4 rdf:first sg:person.0645500667.93
    128 rdf:rest Naf4e210807774da7aa3edffdce16a581
    129 N6ab015de7bce4345ac04c4a7b6da4c10 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
    130 schema:name Sequence Alignment
    131 rdf:type schema:DefinedTerm
    132 N6b9d826396374e1dba4ae79cb291b3da schema:name doi
    133 schema:value 10.1186/1472-6807-6-3
    134 rdf:type schema:PropertyValue
    135 N6e4e4451e0444309b76b56fa3358743e schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
    136 schema:name Protein Folding
    137 rdf:type schema:DefinedTerm
    138 N7575b7c38495420ba4654bc74a85d148 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
    139 schema:name Cluster Analysis
    140 rdf:type schema:DefinedTerm
    141 N77c3a8d9ddfa4d129565f8b94418c737 schema:name dimensions_id
    142 schema:value pub.1033549508
    143 rdf:type schema:PropertyValue
    144 N96d55d15a5644afea0a1abf48e54e3de schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
    145 schema:name Algorithms
    146 rdf:type schema:DefinedTerm
    147 Na2deabe760a7406aa80f5a349e8f3b23 schema:name Springer Nature - SN SciGraph project
    148 rdf:type schema:Organization
    149 Na8b6869e00194ff5b9a18e3408a4047a schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
    150 schema:name Proteins
    151 rdf:type schema:DefinedTerm
    152 Naf4e210807774da7aa3edffdce16a581 rdf:first sg:person.01274506210.27
    153 rdf:rest N02e18447d01049b7a076d197dbd00646
    154 Ncd2f65f435ea42fbaec5038b282c0fbe schema:volumeNumber 6
    155 rdf:type schema:PublicationVolume
    156 Nd0df4152a0b941298e4e87d0267bb17d schema:issueNumber 1
    157 rdf:type schema:PublicationIssue
    158 Ne32de8d7b11e4fdcbb2b28bd15d43530 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
    159 schema:name Amino Acid Sequence
    160 rdf:type schema:DefinedTerm
    161 anzsrc-for:06 schema:inDefinedTermSet anzsrc-for:
    162 schema:name Biological Sciences
    163 rdf:type schema:DefinedTerm
    164 anzsrc-for:0601 schema:inDefinedTermSet anzsrc-for:
    165 schema:name Biochemistry and Cell Biology
    166 rdf:type schema:DefinedTerm
    167 sg:grant.2529090 http://pending.schema.org/fundedItem sg:pub.10.1186/1472-6807-6-3
    168 rdf:type schema:MonetaryGrant
    169 sg:grant.2720312 http://pending.schema.org/fundedItem sg:pub.10.1186/1472-6807-6-3
    170 rdf:type schema:MonetaryGrant
    171 sg:grant.3027409 http://pending.schema.org/fundedItem sg:pub.10.1186/1472-6807-6-3
    172 rdf:type schema:MonetaryGrant
    173 sg:journal.1024246 schema:issn 2314-4343
    174 2661-8850
    175 schema:name BMC Molecular and Cell Biology
    176 schema:publisher Springer Nature
    177 rdf:type schema:Periodical
    178 sg:person.01274506210.27 schema:affiliation grid-institutes:grid.29857.31
    179 schema:familyName Berman
    180 schema:givenName Piotr
    181 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01274506210.27
    182 rdf:type schema:Person
    183 sg:person.01325035263.95 schema:affiliation grid-institutes:grid.419234.9
    184 schema:familyName Przytycka
    185 schema:givenName Teresa
    186 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01325035263.95
    187 rdf:type schema:Person
    188 sg:person.0645500667.93 schema:affiliation grid-institutes:grid.29857.31
    189 schema:familyName Jeong
    190 schema:givenName Jieun
    191 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0645500667.93
    192 rdf:type schema:Person
    193 sg:pub.10.1038/10728 schema:sameAs https://app.dimensions.ai/details/publication/pub.1031624707
    194 https://doi.org/10.1038/10728
    195 rdf:type schema:CreativeWork
    196 sg:pub.10.1038/nsb1101-953 schema:sameAs https://app.dimensions.ai/details/publication/pub.1037147119
    197 https://doi.org/10.1038/nsb1101-953
    198 rdf:type schema:CreativeWork
    199 grid-institutes:grid.29857.31 schema:alternateName Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA
    200 schema:name Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA
    201 rdf:type schema:Organization
    202 grid-institutes:grid.419234.9 schema:alternateName National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, Bethesda, USA
    203 schema:name National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, Bethesda, USA
    204 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...