Kalign – an accurate and fast multiple sequence alignment algorithm View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2005-12-12

AUTHORS

Timo Lassmann, Erik LL Sonnhammer

ABSTRACT

BACKGROUND: The alignment of multiple protein sequences is a fundamental step in the analysis of biological data. It has traditionally been applied to analyzing protein families for conserved motifs, phylogeny, structural properties, and to improve sensitivity in homology searching. The availability of complete genome sequences has increased the demands on multiple sequence alignment (MSA) programs. Current MSA methods suffer from being either too inaccurate or too computationally expensive to be applied effectively in large-scale comparative genomics. RESULTS: We developed Kalign, a method employing the Wu-Manber string-matching algorithm, to improve both the accuracy and speed of multiple sequence alignment. We compared the speed and accuracy of Kalign to other popular methods using Balibase, Prefab, and a new large test set. Kalign was as accurate as the best other methods on small alignments, but significantly more accurate when aligning large and distantly related sets of sequences. In our comparisons, Kalign was about 10 times faster than ClustalW and, depending on the alignment size, up to 50 times faster than popular iterative methods. CONCLUSION: Kalign is a fast and robust alignment method. It is especially well suited for the increasingly important task of aligning large numbers of sequences. More... »

PAGES

298-298

References to SciGraph publications

Identifiers

URI

http://scigraph.springernature.com/pub.10.1186/1471-2105-6-298

DOI

http://dx.doi.org/10.1186/1471-2105-6-298

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1047546418

PUBMED

https://www.ncbi.nlm.nih.gov/pubmed/16343337


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0806", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information Systems", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Algorithms", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Databases, Nucleic Acid", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Databases, Protein", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Genome", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Genomics", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Reproducibility of Results", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Sequence Alignment", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Sequence Analysis, DNA", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Sequence Analysis, Protein", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius vag 35, S-17177 Stockholm, Sweden", 
          "id": "http://www.grid.ac/institutes/grid.4714.6", 
          "name": [
            "Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius vag 35, S-17177 Stockholm, Sweden"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Lassmann", 
        "givenName": "Timo", 
        "id": "sg:person.01161270315.84", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01161270315.84"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius vag 35, S-17177 Stockholm, Sweden", 
          "id": "http://www.grid.ac/institutes/grid.4714.6", 
          "name": [
            "Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius vag 35, S-17177 Stockholm, Sweden"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Sonnhammer", 
        "givenName": "Erik LL", 
        "id": "sg:person.01215262030.04", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01215262030.04"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "sg:pub.10.1007/bf02603120", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1022962956", 
          "https://doi.org/10.1007/bf02603120"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2005-12-12", 
    "datePublishedReg": "2005-12-12", 
    "description": "BACKGROUND: The alignment of multiple protein sequences is a fundamental step in the analysis of biological data. It has traditionally been applied to analyzing protein families for conserved motifs, phylogeny, structural properties, and to improve sensitivity in homology searching. The availability of complete genome sequences has increased the demands on multiple sequence alignment (MSA) programs. Current MSA methods suffer from being either too inaccurate or too computationally expensive to be applied effectively in large-scale comparative genomics.\nRESULTS: We developed Kalign, a method employing the Wu-Manber string-matching algorithm, to improve both the accuracy and speed of multiple sequence alignment. We compared the speed and accuracy of Kalign to other popular methods using Balibase, Prefab, and a new large test set. Kalign was as accurate as the best other methods on small alignments, but significantly more accurate when aligning large and distantly related sets of sequences. In our comparisons, Kalign was about 10 times faster than ClustalW and, depending on the alignment size, up to 50 times faster than popular iterative methods.\nCONCLUSION: Kalign is a fast and robust alignment method. It is especially well suited for the increasingly important task of aligning large numbers of sequences.", 
    "genre": "article", 
    "id": "sg:pub.10.1186/1471-2105-6-298", 
    "inLanguage": "en", 
    "isAccessibleForFree": true, 
    "isPartOf": [
      {
        "id": "sg:journal.1023786", 
        "issn": [
          "1471-2105"
        ], 
        "name": "BMC Bioinformatics", 
        "publisher": "Springer Nature", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "1", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "6"
      }
    ], 
    "keywords": [
      "multiple sequence alignment algorithm", 
      "string-matching algorithm", 
      "robust alignment method", 
      "multiple sequence alignment program", 
      "sequence alignment algorithm", 
      "multiple protein sequences", 
      "sequence alignment programs", 
      "large test set", 
      "alignment algorithm", 
      "set of sequences", 
      "Kalign", 
      "alignment size", 
      "important task", 
      "popular iterative method", 
      "alignment programs", 
      "multiple sequence alignment", 
      "alignment method", 
      "test set", 
      "algorithm", 
      "biological data", 
      "MSA methods", 
      "popular method", 
      "fundamental step", 
      "small alignments", 
      "iterative method", 
      "sequence alignment", 
      "large number", 
      "accuracy", 
      "BAliBASE", 
      "set", 
      "large-scale comparative genomics", 
      "PREFAB", 
      "task", 
      "ClustalW", 
      "speed", 
      "alignment", 
      "method", 
      "protein sequences", 
      "sequence", 
      "demand", 
      "time", 
      "step", 
      "availability", 
      "data", 
      "program", 
      "number", 
      "size", 
      "analysis", 
      "comparative genomics", 
      "comparison", 
      "genomics", 
      "structural properties", 
      "properties", 
      "genome sequence", 
      "family", 
      "protein family", 
      "sensitivity", 
      "motif", 
      "complete genome sequence", 
      "homology", 
      "Current MSA methods", 
      "Wu-Manber string-matching algorithm", 
      "accuracy of Kalign", 
      "new large test set", 
      "fast multiple sequence alignment algorithm"
    ], 
    "name": "Kalign \u2013 an accurate and fast multiple sequence alignment algorithm", 
    "pagination": "298-298", 
    "productId": [
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1047546418"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1186/1471-2105-6-298"
        ]
      }, 
      {
        "name": "pubmed_id", 
        "type": "PropertyValue", 
        "value": [
          "16343337"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1186/1471-2105-6-298", 
      "https://app.dimensions.ai/details/publication/pub.1047546418"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2021-12-01T19:16", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-springernature-scigraph/baseset/20211201/entities/gbq_results/article/article_400.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "https://doi.org/10.1186/1471-2105-6-298"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-6-298'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-6-298'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-6-298'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-6-298'


 

This table displays all metadata directly associated to this object as RDF triples.

173 TRIPLES      22 PREDICATES      101 URIs      92 LITERALS      16 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1186/1471-2105-6-298 schema:about N1bfc88f43e2648ba8ce0102e41c8dba9
2 N31ab0013aeae4666be1ffe733effbf32
3 N4b229869cbe040bfb70ab4746d220d60
4 N52a6c6bc902740b69402ab3514e20cf6
5 N5c29668dddea4fdfa2976a07119f1d51
6 Na8af0d1476994e9ebfc4ab59f776a13e
7 Nfc544117065b4b59874f8c17bb5a5302
8 Nfdb8af9385ec436690e799a2f0d0dd0b
9 Nfde80511356a42efb71d7ca8d138878d
10 anzsrc-for:08
11 anzsrc-for:0806
12 schema:author N04be160010dc4ab2b55d71cbf85d141d
13 schema:citation sg:pub.10.1007/bf02603120
14 schema:datePublished 2005-12-12
15 schema:datePublishedReg 2005-12-12
16 schema:description BACKGROUND: The alignment of multiple protein sequences is a fundamental step in the analysis of biological data. It has traditionally been applied to analyzing protein families for conserved motifs, phylogeny, structural properties, and to improve sensitivity in homology searching. The availability of complete genome sequences has increased the demands on multiple sequence alignment (MSA) programs. Current MSA methods suffer from being either too inaccurate or too computationally expensive to be applied effectively in large-scale comparative genomics. RESULTS: We developed Kalign, a method employing the Wu-Manber string-matching algorithm, to improve both the accuracy and speed of multiple sequence alignment. We compared the speed and accuracy of Kalign to other popular methods using Balibase, Prefab, and a new large test set. Kalign was as accurate as the best other methods on small alignments, but significantly more accurate when aligning large and distantly related sets of sequences. In our comparisons, Kalign was about 10 times faster than ClustalW and, depending on the alignment size, up to 50 times faster than popular iterative methods. CONCLUSION: Kalign is a fast and robust alignment method. It is especially well suited for the increasingly important task of aligning large numbers of sequences.
17 schema:genre article
18 schema:inLanguage en
19 schema:isAccessibleForFree true
20 schema:isPartOf N7a6ef3884483437b9632a7b435a4f357
21 Nb89b743d1bb84a31b015317522d26eb0
22 sg:journal.1023786
23 schema:keywords BAliBASE
24 ClustalW
25 Current MSA methods
26 Kalign
27 MSA methods
28 PREFAB
29 Wu-Manber string-matching algorithm
30 accuracy
31 accuracy of Kalign
32 algorithm
33 alignment
34 alignment algorithm
35 alignment method
36 alignment programs
37 alignment size
38 analysis
39 availability
40 biological data
41 comparative genomics
42 comparison
43 complete genome sequence
44 data
45 demand
46 family
47 fast multiple sequence alignment algorithm
48 fundamental step
49 genome sequence
50 genomics
51 homology
52 important task
53 iterative method
54 large number
55 large test set
56 large-scale comparative genomics
57 method
58 motif
59 multiple protein sequences
60 multiple sequence alignment
61 multiple sequence alignment algorithm
62 multiple sequence alignment program
63 new large test set
64 number
65 popular iterative method
66 popular method
67 program
68 properties
69 protein family
70 protein sequences
71 robust alignment method
72 sensitivity
73 sequence
74 sequence alignment
75 sequence alignment algorithm
76 sequence alignment programs
77 set
78 set of sequences
79 size
80 small alignments
81 speed
82 step
83 string-matching algorithm
84 structural properties
85 task
86 test set
87 time
88 schema:name Kalign – an accurate and fast multiple sequence alignment algorithm
89 schema:pagination 298-298
90 schema:productId N2008b46ac2504e7cb3ead4e453a3dd17
91 N4c54cc18172a44b8bf8fecedba885ffb
92 N950ee95814694109bb9ef1b7dc9ef3bc
93 schema:sameAs https://app.dimensions.ai/details/publication/pub.1047546418
94 https://doi.org/10.1186/1471-2105-6-298
95 schema:sdDatePublished 2021-12-01T19:16
96 schema:sdLicense https://scigraph.springernature.com/explorer/license/
97 schema:sdPublisher N4b0548b51ea1464bb01cbcb0c50ddde2
98 schema:url https://doi.org/10.1186/1471-2105-6-298
99 sgo:license sg:explorer/license/
100 sgo:sdDataset articles
101 rdf:type schema:ScholarlyArticle
102 N04be160010dc4ab2b55d71cbf85d141d rdf:first sg:person.01161270315.84
103 rdf:rest N5508131e26084270a24046ca11ac0105
104 N1bfc88f43e2648ba8ce0102e41c8dba9 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
105 schema:name Databases, Protein
106 rdf:type schema:DefinedTerm
107 N2008b46ac2504e7cb3ead4e453a3dd17 schema:name dimensions_id
108 schema:value pub.1047546418
109 rdf:type schema:PropertyValue
110 N31ab0013aeae4666be1ffe733effbf32 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
111 schema:name Genome
112 rdf:type schema:DefinedTerm
113 N4b0548b51ea1464bb01cbcb0c50ddde2 schema:name Springer Nature - SN SciGraph project
114 rdf:type schema:Organization
115 N4b229869cbe040bfb70ab4746d220d60 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
116 schema:name Databases, Nucleic Acid
117 rdf:type schema:DefinedTerm
118 N4c54cc18172a44b8bf8fecedba885ffb schema:name doi
119 schema:value 10.1186/1471-2105-6-298
120 rdf:type schema:PropertyValue
121 N52a6c6bc902740b69402ab3514e20cf6 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
122 schema:name Sequence Analysis, DNA
123 rdf:type schema:DefinedTerm
124 N5508131e26084270a24046ca11ac0105 rdf:first sg:person.01215262030.04
125 rdf:rest rdf:nil
126 N5c29668dddea4fdfa2976a07119f1d51 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
127 schema:name Sequence Alignment
128 rdf:type schema:DefinedTerm
129 N7a6ef3884483437b9632a7b435a4f357 schema:issueNumber 1
130 rdf:type schema:PublicationIssue
131 N950ee95814694109bb9ef1b7dc9ef3bc schema:name pubmed_id
132 schema:value 16343337
133 rdf:type schema:PropertyValue
134 Na8af0d1476994e9ebfc4ab59f776a13e schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
135 schema:name Algorithms
136 rdf:type schema:DefinedTerm
137 Nb89b743d1bb84a31b015317522d26eb0 schema:volumeNumber 6
138 rdf:type schema:PublicationVolume
139 Nfc544117065b4b59874f8c17bb5a5302 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
140 schema:name Sequence Analysis, Protein
141 rdf:type schema:DefinedTerm
142 Nfdb8af9385ec436690e799a2f0d0dd0b schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
143 schema:name Genomics
144 rdf:type schema:DefinedTerm
145 Nfde80511356a42efb71d7ca8d138878d schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
146 schema:name Reproducibility of Results
147 rdf:type schema:DefinedTerm
148 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
149 schema:name Information and Computing Sciences
150 rdf:type schema:DefinedTerm
151 anzsrc-for:0806 schema:inDefinedTermSet anzsrc-for:
152 schema:name Information Systems
153 rdf:type schema:DefinedTerm
154 sg:journal.1023786 schema:issn 1471-2105
155 schema:name BMC Bioinformatics
156 schema:publisher Springer Nature
157 rdf:type schema:Periodical
158 sg:person.01161270315.84 schema:affiliation grid-institutes:grid.4714.6
159 schema:familyName Lassmann
160 schema:givenName Timo
161 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01161270315.84
162 rdf:type schema:Person
163 sg:person.01215262030.04 schema:affiliation grid-institutes:grid.4714.6
164 schema:familyName Sonnhammer
165 schema:givenName Erik LL
166 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01215262030.04
167 rdf:type schema:Person
168 sg:pub.10.1007/bf02603120 schema:sameAs https://app.dimensions.ai/details/publication/pub.1022962956
169 https://doi.org/10.1007/bf02603120
170 rdf:type schema:CreativeWork
171 grid-institutes:grid.4714.6 schema:alternateName Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius vag 35, S-17177 Stockholm, Sweden
172 schema:name Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius vag 35, S-17177 Stockholm, Sweden
173 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...