Sequence embedding for fast construction of guide trees for multiple sequence alignment View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2010-05-14

AUTHORS

Gordon Blackshields, Fabian Sievers, Weifeng Shi, Andreas Wilm, Desmond G Higgins

ABSTRACT

BackgroundThe most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.ResultsIn this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.ConclusionsWe show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz. More... »

PAGES

21

Identifiers

URI

http://scigraph.springernature.com/pub.10.1186/1748-7188-5-21

DOI

http://dx.doi.org/10.1186/1748-7188-5-21

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1011273666

PUBMED

https://www.ncbi.nlm.nih.gov/pubmed/20470396


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0806", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information Systems", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland", 
          "id": "http://www.grid.ac/institutes/grid.7886.1", 
          "name": [
            "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Blackshields", 
        "givenName": "Gordon", 
        "id": "sg:person.01175753635.10", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01175753635.10"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland", 
          "id": "http://www.grid.ac/institutes/grid.7886.1", 
          "name": [
            "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Sievers", 
        "givenName": "Fabian", 
        "id": "sg:person.01256311572.34", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01256311572.34"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland", 
          "id": "http://www.grid.ac/institutes/grid.7886.1", 
          "name": [
            "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Shi", 
        "givenName": "Weifeng", 
        "id": "sg:person.0711212027.68", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0711212027.68"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland", 
          "id": "http://www.grid.ac/institutes/grid.7886.1", 
          "name": [
            "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Wilm", 
        "givenName": "Andreas", 
        "id": "sg:person.0716312730.25", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0716312730.25"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland", 
          "id": "http://www.grid.ac/institutes/grid.7886.1", 
          "name": [
            "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Higgins", 
        "givenName": "Desmond G", 
        "id": "sg:person.01065366335.84", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01065366335.84"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "sg:pub.10.1007/bf02603120", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1022962956", 
          "https://doi.org/10.1007/bf02603120"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf02289565", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1044215102", 
          "https://doi.org/10.1007/bf02289565"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf02257378", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1014365591", 
          "https://doi.org/10.1007/bf02257378"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf01200757", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1015898290", 
          "https://doi.org/10.1007/bf01200757"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2010-05-14", 
    "datePublishedReg": "2010-05-14", 
    "description": "BackgroundThe most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.ResultsIn this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.ConclusionsWe show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz.", 
    "genre": "article", 
    "id": "sg:pub.10.1186/1748-7188-5-21", 
    "inLanguage": "en", 
    "isAccessibleForFree": true, 
    "isFundedItemOf": [
      {
        "id": "sg:grant.3982818", 
        "type": "MonetaryGrant"
      }
    ], 
    "isPartOf": [
      {
        "id": "sg:journal.1036449", 
        "issn": [
          "1748-7188"
        ], 
        "name": "Algorithms for Molecular Biology", 
        "publisher": "Springer Nature", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "1", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "5"
      }
    ], 
    "keywords": [
      "full distance matrix", 
      "guide tree", 
      "large multiple alignments", 
      "sequence alignment methods", 
      "multiple sequence alignment methods", 
      "source code", 
      "multiple alignment", 
      "pair-wise distances", 
      "memory requirements", 
      "set of sequences", 
      "pairs of sequences", 
      "computation time", 
      "complex objects", 
      "embedding method", 
      "distance calculation", 
      "multiple sequence alignment", 
      "alignment method", 
      "large number", 
      "fast construction", 
      "distance matrix", 
      "sequence alignment", 
      "download", 
      "clustering", 
      "trees", 
      "alignment", 
      "objects", 
      "code", 
      "requirements", 
      "method", 
      "set", 
      "memory", 
      "sequence", 
      "initial step", 
      "space", 
      "number", 
      "time", 
      "significant barriers", 
      "quality", 
      "step", 
      "similarity", 
      "construction", 
      "class", 
      "calculations", 
      "matrix", 
      "most sequences", 
      "distance", 
      "pairs", 
      "approach", 
      "variation", 
      "barriers", 
      "N2", 
      "ConclusionsWe", 
      "ResultsIn", 
      "paper"
    ], 
    "name": "Sequence embedding for fast construction of guide trees for multiple sequence alignment", 
    "pagination": "21", 
    "productId": [
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1011273666"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1186/1748-7188-5-21"
        ]
      }, 
      {
        "name": "pubmed_id", 
        "type": "PropertyValue", 
        "value": [
          "20470396"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1186/1748-7188-5-21", 
      "https://app.dimensions.ai/details/publication/pub.1011273666"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2022-05-20T07:26", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-springernature-scigraph/baseset/20220519/entities/gbq_results/article/article_520.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "https://doi.org/10.1186/1748-7188-5-21"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1186/1748-7188-5-21'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1186/1748-7188-5-21'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1186/1748-7188-5-21'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1186/1748-7188-5-21'


 

This table displays all metadata directly associated to this object as RDF triples.

161 TRIPLES      22 PREDICATES      84 URIs      72 LITERALS      7 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1186/1748-7188-5-21 schema:about anzsrc-for:08
2 anzsrc-for:0806
3 schema:author Nf94ba4dd7cdd45fca4a9c163475a8b30
4 schema:citation sg:pub.10.1007/bf01200757
5 sg:pub.10.1007/bf02257378
6 sg:pub.10.1007/bf02289565
7 sg:pub.10.1007/bf02603120
8 schema:datePublished 2010-05-14
9 schema:datePublishedReg 2010-05-14
10 schema:description BackgroundThe most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.ResultsIn this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.ConclusionsWe show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz.
11 schema:genre article
12 schema:inLanguage en
13 schema:isAccessibleForFree true
14 schema:isPartOf N816a889361e64b368cc58b24483034e3
15 N83898675f1a54b8b9a99fb9a8c16a1eb
16 sg:journal.1036449
17 schema:keywords ConclusionsWe
18 N2
19 ResultsIn
20 alignment
21 alignment method
22 approach
23 barriers
24 calculations
25 class
26 clustering
27 code
28 complex objects
29 computation time
30 construction
31 distance
32 distance calculation
33 distance matrix
34 download
35 embedding method
36 fast construction
37 full distance matrix
38 guide tree
39 initial step
40 large multiple alignments
41 large number
42 matrix
43 memory
44 memory requirements
45 method
46 most sequences
47 multiple alignment
48 multiple sequence alignment
49 multiple sequence alignment methods
50 number
51 objects
52 pair-wise distances
53 pairs
54 pairs of sequences
55 paper
56 quality
57 requirements
58 sequence
59 sequence alignment
60 sequence alignment methods
61 set
62 set of sequences
63 significant barriers
64 similarity
65 source code
66 space
67 step
68 time
69 trees
70 variation
71 schema:name Sequence embedding for fast construction of guide trees for multiple sequence alignment
72 schema:pagination 21
73 schema:productId N051ad54398da49f3b3f28908d6c1325b
74 N3a847e190fa64330ad1e9e1cab732532
75 Nf604bcddd90a4b5c9a936863e2c17f30
76 schema:sameAs https://app.dimensions.ai/details/publication/pub.1011273666
77 https://doi.org/10.1186/1748-7188-5-21
78 schema:sdDatePublished 2022-05-20T07:26
79 schema:sdLicense https://scigraph.springernature.com/explorer/license/
80 schema:sdPublisher N8771e613a5f04a57b86aea2d2331539f
81 schema:url https://doi.org/10.1186/1748-7188-5-21
82 sgo:license sg:explorer/license/
83 sgo:sdDataset articles
84 rdf:type schema:ScholarlyArticle
85 N051ad54398da49f3b3f28908d6c1325b schema:name doi
86 schema:value 10.1186/1748-7188-5-21
87 rdf:type schema:PropertyValue
88 N2331d9e80a714eb8807003418b06e544 rdf:first sg:person.01256311572.34
89 rdf:rest N26976a0a328c41ebb1bb524c84fa0bcb
90 N26976a0a328c41ebb1bb524c84fa0bcb rdf:first sg:person.0711212027.68
91 rdf:rest Nd5f45108fed249d990f604474e9636e4
92 N3a847e190fa64330ad1e9e1cab732532 schema:name dimensions_id
93 schema:value pub.1011273666
94 rdf:type schema:PropertyValue
95 N3d5acf87494b4b0c822bd1a5de0bc937 rdf:first sg:person.01065366335.84
96 rdf:rest rdf:nil
97 N816a889361e64b368cc58b24483034e3 schema:volumeNumber 5
98 rdf:type schema:PublicationVolume
99 N83898675f1a54b8b9a99fb9a8c16a1eb schema:issueNumber 1
100 rdf:type schema:PublicationIssue
101 N8771e613a5f04a57b86aea2d2331539f schema:name Springer Nature - SN SciGraph project
102 rdf:type schema:Organization
103 Nd5f45108fed249d990f604474e9636e4 rdf:first sg:person.0716312730.25
104 rdf:rest N3d5acf87494b4b0c822bd1a5de0bc937
105 Nf604bcddd90a4b5c9a936863e2c17f30 schema:name pubmed_id
106 schema:value 20470396
107 rdf:type schema:PropertyValue
108 Nf94ba4dd7cdd45fca4a9c163475a8b30 rdf:first sg:person.01175753635.10
109 rdf:rest N2331d9e80a714eb8807003418b06e544
110 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
111 schema:name Information and Computing Sciences
112 rdf:type schema:DefinedTerm
113 anzsrc-for:0806 schema:inDefinedTermSet anzsrc-for:
114 schema:name Information Systems
115 rdf:type schema:DefinedTerm
116 sg:grant.3982818 http://pending.schema.org/fundedItem sg:pub.10.1186/1748-7188-5-21
117 rdf:type schema:MonetaryGrant
118 sg:journal.1036449 schema:issn 1748-7188
119 schema:name Algorithms for Molecular Biology
120 schema:publisher Springer Nature
121 rdf:type schema:Periodical
122 sg:person.01065366335.84 schema:affiliation grid-institutes:grid.7886.1
123 schema:familyName Higgins
124 schema:givenName Desmond G
125 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01065366335.84
126 rdf:type schema:Person
127 sg:person.01175753635.10 schema:affiliation grid-institutes:grid.7886.1
128 schema:familyName Blackshields
129 schema:givenName Gordon
130 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01175753635.10
131 rdf:type schema:Person
132 sg:person.01256311572.34 schema:affiliation grid-institutes:grid.7886.1
133 schema:familyName Sievers
134 schema:givenName Fabian
135 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01256311572.34
136 rdf:type schema:Person
137 sg:person.0711212027.68 schema:affiliation grid-institutes:grid.7886.1
138 schema:familyName Shi
139 schema:givenName Weifeng
140 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0711212027.68
141 rdf:type schema:Person
142 sg:person.0716312730.25 schema:affiliation grid-institutes:grid.7886.1
143 schema:familyName Wilm
144 schema:givenName Andreas
145 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0716312730.25
146 rdf:type schema:Person
147 sg:pub.10.1007/bf01200757 schema:sameAs https://app.dimensions.ai/details/publication/pub.1015898290
148 https://doi.org/10.1007/bf01200757
149 rdf:type schema:CreativeWork
150 sg:pub.10.1007/bf02257378 schema:sameAs https://app.dimensions.ai/details/publication/pub.1014365591
151 https://doi.org/10.1007/bf02257378
152 rdf:type schema:CreativeWork
153 sg:pub.10.1007/bf02289565 schema:sameAs https://app.dimensions.ai/details/publication/pub.1044215102
154 https://doi.org/10.1007/bf02289565
155 rdf:type schema:CreativeWork
156 sg:pub.10.1007/bf02603120 schema:sameAs https://app.dimensions.ai/details/publication/pub.1022962956
157 https://doi.org/10.1007/bf02603120
158 rdf:type schema:CreativeWork
159 grid-institutes:grid.7886.1 schema:alternateName UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland
160 schema:name UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland
161 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...