Kalign – an accurate and fast multiple sequence alignment algorithm View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2005-12-12

AUTHORS

Timo Lassmann, Erik LL Sonnhammer

ABSTRACT

BACKGROUND: The alignment of multiple protein sequences is a fundamental step in the analysis of biological data. It has traditionally been applied to analyzing protein families for conserved motifs, phylogeny, structural properties, and to improve sensitivity in homology searching. The availability of complete genome sequences has increased the demands on multiple sequence alignment (MSA) programs. Current MSA methods suffer from being either too inaccurate or too computationally expensive to be applied effectively in large-scale comparative genomics. RESULTS: We developed Kalign, a method employing the Wu-Manber string-matching algorithm, to improve both the accuracy and speed of multiple sequence alignment. We compared the speed and accuracy of Kalign to other popular methods using Balibase, Prefab, and a new large test set. Kalign was as accurate as the best other methods on small alignments, but significantly more accurate when aligning large and distantly related sets of sequences. In our comparisons, Kalign was about 10 times faster than ClustalW and, depending on the alignment size, up to 50 times faster than popular iterative methods. CONCLUSION: Kalign is a fast and robust alignment method. It is especially well suited for the increasingly important task of aligning large numbers of sequences. More... »

PAGES

298-298

References to SciGraph publications

Identifiers

URI

http://scigraph.springernature.com/pub.10.1186/1471-2105-6-298

DOI

http://dx.doi.org/10.1186/1471-2105-6-298

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1047546418

PUBMED

https://www.ncbi.nlm.nih.gov/pubmed/16343337


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0806", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information Systems", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Algorithms", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Databases, Nucleic Acid", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Databases, Protein", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Genome", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Genomics", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Reproducibility of Results", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Sequence Alignment", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Sequence Analysis, DNA", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Sequence Analysis, Protein", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius vag 35, S-17177 Stockholm, Sweden", 
          "id": "http://www.grid.ac/institutes/grid.4714.6", 
          "name": [
            "Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius vag 35, S-17177 Stockholm, Sweden"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Lassmann", 
        "givenName": "Timo", 
        "id": "sg:person.01161270315.84", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01161270315.84"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius vag 35, S-17177 Stockholm, Sweden", 
          "id": "http://www.grid.ac/institutes/grid.4714.6", 
          "name": [
            "Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius vag 35, S-17177 Stockholm, Sweden"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Sonnhammer", 
        "givenName": "Erik LL", 
        "id": "sg:person.01215262030.04", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01215262030.04"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "sg:pub.10.1007/bf02603120", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1022962956", 
          "https://doi.org/10.1007/bf02603120"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2005-12-12", 
    "datePublishedReg": "2005-12-12", 
    "description": "BACKGROUND: The alignment of multiple protein sequences is a fundamental step in the analysis of biological data. It has traditionally been applied to analyzing protein families for conserved motifs, phylogeny, structural properties, and to improve sensitivity in homology searching. The availability of complete genome sequences has increased the demands on multiple sequence alignment (MSA) programs. Current MSA methods suffer from being either too inaccurate or too computationally expensive to be applied effectively in large-scale comparative genomics.\nRESULTS: We developed Kalign, a method employing the Wu-Manber string-matching algorithm, to improve both the accuracy and speed of multiple sequence alignment. We compared the speed and accuracy of Kalign to other popular methods using Balibase, Prefab, and a new large test set. Kalign was as accurate as the best other methods on small alignments, but significantly more accurate when aligning large and distantly related sets of sequences. In our comparisons, Kalign was about 10 times faster than ClustalW and, depending on the alignment size, up to 50 times faster than popular iterative methods.\nCONCLUSION: Kalign is a fast and robust alignment method. It is especially well suited for the increasingly important task of aligning large numbers of sequences.", 
    "genre": "article", 
    "id": "sg:pub.10.1186/1471-2105-6-298", 
    "inLanguage": "en", 
    "isAccessibleForFree": true, 
    "isPartOf": [
      {
        "id": "sg:journal.1023786", 
        "issn": [
          "1471-2105"
        ], 
        "name": "BMC Bioinformatics", 
        "publisher": "Springer Nature", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "1", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "6"
      }
    ], 
    "keywords": [
      "multiple sequence alignment algorithm", 
      "string-matching algorithm", 
      "robust alignment method", 
      "multiple sequence alignment program", 
      "sequence alignment algorithm", 
      "multiple protein sequences", 
      "sequence alignment programs", 
      "large test set", 
      "alignment algorithm", 
      "set of sequences", 
      "Kalign", 
      "alignment size", 
      "important task", 
      "popular iterative method", 
      "alignment programs", 
      "multiple sequence alignment", 
      "alignment method", 
      "test set", 
      "algorithm", 
      "biological data", 
      "MSA methods", 
      "popular method", 
      "fundamental step", 
      "small alignments", 
      "iterative method", 
      "sequence alignment", 
      "large number", 
      "accuracy", 
      "BAliBASE", 
      "set", 
      "large-scale comparative genomics", 
      "PREFAB", 
      "task", 
      "ClustalW", 
      "speed", 
      "alignment", 
      "method", 
      "protein sequences", 
      "sequence", 
      "demand", 
      "time", 
      "step", 
      "availability", 
      "data", 
      "program", 
      "number", 
      "size", 
      "analysis", 
      "comparative genomics", 
      "comparison", 
      "genomics", 
      "structural properties", 
      "properties", 
      "genome sequence", 
      "family", 
      "protein family", 
      "sensitivity", 
      "motif", 
      "complete genome sequence", 
      "homology", 
      "Current MSA methods", 
      "Wu-Manber string-matching algorithm", 
      "accuracy of Kalign", 
      "new large test set", 
      "fast multiple sequence alignment algorithm"
    ], 
    "name": "Kalign \u2013 an accurate and fast multiple sequence alignment algorithm", 
    "pagination": "298-298", 
    "productId": [
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1047546418"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1186/1471-2105-6-298"
        ]
      }, 
      {
        "name": "pubmed_id", 
        "type": "PropertyValue", 
        "value": [
          "16343337"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1186/1471-2105-6-298", 
      "https://app.dimensions.ai/details/publication/pub.1047546418"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2021-12-01T19:16", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-springernature-scigraph/baseset/20211201/entities/gbq_results/article/article_400.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "https://doi.org/10.1186/1471-2105-6-298"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-6-298'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-6-298'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-6-298'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1186/1471-2105-6-298'


 

This table displays all metadata directly associated to this object as RDF triples.

173 TRIPLES      22 PREDICATES      101 URIs      92 LITERALS      16 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1186/1471-2105-6-298 schema:about N02bde8dd442f4b04be20bec03c594c1e
2 N14b8feb3d363495495c610a322f7d47c
3 N24caa93c11924968a21a08ae3e2687ca
4 N4d9eeb695ae442999dc58eb05340255c
5 N563b43183ec44b3c9459334a4f2bfc21
6 N6db2c1a8d08a4b0491b24b3e12f06a6f
7 N813fda6835104b8eb9b250ab3d5b055b
8 N9e90459623d242158e99ae260c6c912f
9 Nb831a00857b84132af9e7c723edc0ba5
10 anzsrc-for:08
11 anzsrc-for:0806
12 schema:author N6b2ce12a92014b9cb9afc5f8bb156541
13 schema:citation sg:pub.10.1007/bf02603120
14 schema:datePublished 2005-12-12
15 schema:datePublishedReg 2005-12-12
16 schema:description BACKGROUND: The alignment of multiple protein sequences is a fundamental step in the analysis of biological data. It has traditionally been applied to analyzing protein families for conserved motifs, phylogeny, structural properties, and to improve sensitivity in homology searching. The availability of complete genome sequences has increased the demands on multiple sequence alignment (MSA) programs. Current MSA methods suffer from being either too inaccurate or too computationally expensive to be applied effectively in large-scale comparative genomics. RESULTS: We developed Kalign, a method employing the Wu-Manber string-matching algorithm, to improve both the accuracy and speed of multiple sequence alignment. We compared the speed and accuracy of Kalign to other popular methods using Balibase, Prefab, and a new large test set. Kalign was as accurate as the best other methods on small alignments, but significantly more accurate when aligning large and distantly related sets of sequences. In our comparisons, Kalign was about 10 times faster than ClustalW and, depending on the alignment size, up to 50 times faster than popular iterative methods. CONCLUSION: Kalign is a fast and robust alignment method. It is especially well suited for the increasingly important task of aligning large numbers of sequences.
17 schema:genre article
18 schema:inLanguage en
19 schema:isAccessibleForFree true
20 schema:isPartOf N1df53a7a1c7c4dd18bf08c80cd9385a4
21 Ne8c28f6b3d7b40eabf78886f11907197
22 sg:journal.1023786
23 schema:keywords BAliBASE
24 ClustalW
25 Current MSA methods
26 Kalign
27 MSA methods
28 PREFAB
29 Wu-Manber string-matching algorithm
30 accuracy
31 accuracy of Kalign
32 algorithm
33 alignment
34 alignment algorithm
35 alignment method
36 alignment programs
37 alignment size
38 analysis
39 availability
40 biological data
41 comparative genomics
42 comparison
43 complete genome sequence
44 data
45 demand
46 family
47 fast multiple sequence alignment algorithm
48 fundamental step
49 genome sequence
50 genomics
51 homology
52 important task
53 iterative method
54 large number
55 large test set
56 large-scale comparative genomics
57 method
58 motif
59 multiple protein sequences
60 multiple sequence alignment
61 multiple sequence alignment algorithm
62 multiple sequence alignment program
63 new large test set
64 number
65 popular iterative method
66 popular method
67 program
68 properties
69 protein family
70 protein sequences
71 robust alignment method
72 sensitivity
73 sequence
74 sequence alignment
75 sequence alignment algorithm
76 sequence alignment programs
77 set
78 set of sequences
79 size
80 small alignments
81 speed
82 step
83 string-matching algorithm
84 structural properties
85 task
86 test set
87 time
88 schema:name Kalign – an accurate and fast multiple sequence alignment algorithm
89 schema:pagination 298-298
90 schema:productId N179546f9893449d9b1a0b7ece32de418
91 N4ff14080eeee4acc8c3b8f5f728d0b6d
92 Nfde9a52098ce4035b4142ea8f5675933
93 schema:sameAs https://app.dimensions.ai/details/publication/pub.1047546418
94 https://doi.org/10.1186/1471-2105-6-298
95 schema:sdDatePublished 2021-12-01T19:16
96 schema:sdLicense https://scigraph.springernature.com/explorer/license/
97 schema:sdPublisher N841e06e1f3074114bbb777e8673b60eb
98 schema:url https://doi.org/10.1186/1471-2105-6-298
99 sgo:license sg:explorer/license/
100 sgo:sdDataset articles
101 rdf:type schema:ScholarlyArticle
102 N02bde8dd442f4b04be20bec03c594c1e schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
103 schema:name Databases, Protein
104 rdf:type schema:DefinedTerm
105 N0de24b2ff77f44d4b5a959cb60fc6a86 rdf:first sg:person.01215262030.04
106 rdf:rest rdf:nil
107 N14b8feb3d363495495c610a322f7d47c schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
108 schema:name Genomics
109 rdf:type schema:DefinedTerm
110 N179546f9893449d9b1a0b7ece32de418 schema:name doi
111 schema:value 10.1186/1471-2105-6-298
112 rdf:type schema:PropertyValue
113 N1df53a7a1c7c4dd18bf08c80cd9385a4 schema:volumeNumber 6
114 rdf:type schema:PublicationVolume
115 N24caa93c11924968a21a08ae3e2687ca schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
116 schema:name Sequence Alignment
117 rdf:type schema:DefinedTerm
118 N4d9eeb695ae442999dc58eb05340255c schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
119 schema:name Databases, Nucleic Acid
120 rdf:type schema:DefinedTerm
121 N4ff14080eeee4acc8c3b8f5f728d0b6d schema:name pubmed_id
122 schema:value 16343337
123 rdf:type schema:PropertyValue
124 N563b43183ec44b3c9459334a4f2bfc21 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
125 schema:name Sequence Analysis, DNA
126 rdf:type schema:DefinedTerm
127 N6b2ce12a92014b9cb9afc5f8bb156541 rdf:first sg:person.01161270315.84
128 rdf:rest N0de24b2ff77f44d4b5a959cb60fc6a86
129 N6db2c1a8d08a4b0491b24b3e12f06a6f schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
130 schema:name Genome
131 rdf:type schema:DefinedTerm
132 N813fda6835104b8eb9b250ab3d5b055b schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
133 schema:name Sequence Analysis, Protein
134 rdf:type schema:DefinedTerm
135 N841e06e1f3074114bbb777e8673b60eb schema:name Springer Nature - SN SciGraph project
136 rdf:type schema:Organization
137 N9e90459623d242158e99ae260c6c912f schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
138 schema:name Algorithms
139 rdf:type schema:DefinedTerm
140 Nb831a00857b84132af9e7c723edc0ba5 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
141 schema:name Reproducibility of Results
142 rdf:type schema:DefinedTerm
143 Ne8c28f6b3d7b40eabf78886f11907197 schema:issueNumber 1
144 rdf:type schema:PublicationIssue
145 Nfde9a52098ce4035b4142ea8f5675933 schema:name dimensions_id
146 schema:value pub.1047546418
147 rdf:type schema:PropertyValue
148 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
149 schema:name Information and Computing Sciences
150 rdf:type schema:DefinedTerm
151 anzsrc-for:0806 schema:inDefinedTermSet anzsrc-for:
152 schema:name Information Systems
153 rdf:type schema:DefinedTerm
154 sg:journal.1023786 schema:issn 1471-2105
155 schema:name BMC Bioinformatics
156 schema:publisher Springer Nature
157 rdf:type schema:Periodical
158 sg:person.01161270315.84 schema:affiliation grid-institutes:grid.4714.6
159 schema:familyName Lassmann
160 schema:givenName Timo
161 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01161270315.84
162 rdf:type schema:Person
163 sg:person.01215262030.04 schema:affiliation grid-institutes:grid.4714.6
164 schema:familyName Sonnhammer
165 schema:givenName Erik LL
166 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01215262030.04
167 rdf:type schema:Person
168 sg:pub.10.1007/bf02603120 schema:sameAs https://app.dimensions.ai/details/publication/pub.1022962956
169 https://doi.org/10.1007/bf02603120
170 rdf:type schema:CreativeWork
171 grid-institutes:grid.4714.6 schema:alternateName Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius vag 35, S-17177 Stockholm, Sweden
172 schema:name Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius vag 35, S-17177 Stockholm, Sweden
173 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...