Ontology type: schema:ScholarlyArticle Open Access: True
2010-05-14
AUTHORSGordon Blackshields, Fabian Sievers, Weifeng Shi, Andreas Wilm, Desmond G Higgins
ABSTRACTBackgroundThe most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.ResultsIn this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.ConclusionsWe show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz. More... »
PAGES21
http://scigraph.springernature.com/pub.10.1186/1748-7188-5-21
DOIhttp://dx.doi.org/10.1186/1748-7188-5-21
DIMENSIONShttps://app.dimensions.ai/details/publication/pub.1011273666
PUBMEDhttps://www.ncbi.nlm.nih.gov/pubmed/20470396
JSON-LD is the canonical representation for SciGraph data.
TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT
[
{
"@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json",
"about": [
{
"id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08",
"inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/",
"name": "Information and Computing Sciences",
"type": "DefinedTerm"
},
{
"id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0806",
"inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/",
"name": "Information Systems",
"type": "DefinedTerm"
}
],
"author": [
{
"affiliation": {
"alternateName": "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland",
"id": "http://www.grid.ac/institutes/grid.7886.1",
"name": [
"UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland"
],
"type": "Organization"
},
"familyName": "Blackshields",
"givenName": "Gordon",
"id": "sg:person.01175753635.10",
"sameAs": [
"https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01175753635.10"
],
"type": "Person"
},
{
"affiliation": {
"alternateName": "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland",
"id": "http://www.grid.ac/institutes/grid.7886.1",
"name": [
"UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland"
],
"type": "Organization"
},
"familyName": "Sievers",
"givenName": "Fabian",
"id": "sg:person.01256311572.34",
"sameAs": [
"https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01256311572.34"
],
"type": "Person"
},
{
"affiliation": {
"alternateName": "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland",
"id": "http://www.grid.ac/institutes/grid.7886.1",
"name": [
"UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland"
],
"type": "Organization"
},
"familyName": "Shi",
"givenName": "Weifeng",
"id": "sg:person.0711212027.68",
"sameAs": [
"https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0711212027.68"
],
"type": "Person"
},
{
"affiliation": {
"alternateName": "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland",
"id": "http://www.grid.ac/institutes/grid.7886.1",
"name": [
"UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland"
],
"type": "Organization"
},
"familyName": "Wilm",
"givenName": "Andreas",
"id": "sg:person.0716312730.25",
"sameAs": [
"https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0716312730.25"
],
"type": "Person"
},
{
"affiliation": {
"alternateName": "UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland",
"id": "http://www.grid.ac/institutes/grid.7886.1",
"name": [
"UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland"
],
"type": "Organization"
},
"familyName": "Higgins",
"givenName": "Desmond G",
"id": "sg:person.01065366335.84",
"sameAs": [
"https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01065366335.84"
],
"type": "Person"
}
],
"citation": [
{
"id": "sg:pub.10.1007/bf02603120",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1022962956",
"https://doi.org/10.1007/bf02603120"
],
"type": "CreativeWork"
},
{
"id": "sg:pub.10.1007/bf02289565",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1044215102",
"https://doi.org/10.1007/bf02289565"
],
"type": "CreativeWork"
},
{
"id": "sg:pub.10.1007/bf02257378",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1014365591",
"https://doi.org/10.1007/bf02257378"
],
"type": "CreativeWork"
},
{
"id": "sg:pub.10.1007/bf01200757",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1015898290",
"https://doi.org/10.1007/bf01200757"
],
"type": "CreativeWork"
}
],
"datePublished": "2010-05-14",
"datePublishedReg": "2010-05-14",
"description": "BackgroundThe most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.ResultsIn this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.ConclusionsWe show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz.",
"genre": "article",
"id": "sg:pub.10.1186/1748-7188-5-21",
"inLanguage": "en",
"isAccessibleForFree": true,
"isFundedItemOf": [
{
"id": "sg:grant.3982818",
"type": "MonetaryGrant"
}
],
"isPartOf": [
{
"id": "sg:journal.1036449",
"issn": [
"1748-7188"
],
"name": "Algorithms for Molecular Biology",
"publisher": "Springer Nature",
"type": "Periodical"
},
{
"issueNumber": "1",
"type": "PublicationIssue"
},
{
"type": "PublicationVolume",
"volumeNumber": "5"
}
],
"keywords": [
"full distance matrix",
"guide tree",
"large multiple alignments",
"sequence alignment methods",
"multiple sequence alignment methods",
"source code",
"multiple alignment",
"pair-wise distances",
"memory requirements",
"set of sequences",
"pairs of sequences",
"computation time",
"complex objects",
"embedding method",
"distance calculation",
"multiple sequence alignment",
"alignment method",
"large number",
"fast construction",
"distance matrix",
"sequence alignment",
"download",
"clustering",
"trees",
"alignment",
"objects",
"code",
"requirements",
"method",
"set",
"memory",
"sequence",
"initial step",
"space",
"number",
"time",
"significant barriers",
"quality",
"step",
"similarity",
"construction",
"class",
"calculations",
"matrix",
"most sequences",
"distance",
"pairs",
"approach",
"variation",
"barriers",
"N2",
"ConclusionsWe",
"ResultsIn",
"paper"
],
"name": "Sequence embedding for fast construction of guide trees for multiple sequence alignment",
"pagination": "21",
"productId": [
{
"name": "dimensions_id",
"type": "PropertyValue",
"value": [
"pub.1011273666"
]
},
{
"name": "doi",
"type": "PropertyValue",
"value": [
"10.1186/1748-7188-5-21"
]
},
{
"name": "pubmed_id",
"type": "PropertyValue",
"value": [
"20470396"
]
}
],
"sameAs": [
"https://doi.org/10.1186/1748-7188-5-21",
"https://app.dimensions.ai/details/publication/pub.1011273666"
],
"sdDataset": "articles",
"sdDatePublished": "2022-05-20T07:26",
"sdLicense": "https://scigraph.springernature.com/explorer/license/",
"sdPublisher": {
"name": "Springer Nature - SN SciGraph project",
"type": "Organization"
},
"sdSource": "s3://com-springernature-scigraph/baseset/20220519/entities/gbq_results/article/article_520.jsonl",
"type": "ScholarlyArticle",
"url": "https://doi.org/10.1186/1748-7188-5-21"
}
]
Download the RDF metadata as: json-ld nt turtle xml License info
JSON-LD is a popular format for linked data which is fully compatible with JSON.
curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1186/1748-7188-5-21'
N-Triples is a line-based linked data format ideal for batch operations.
curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1186/1748-7188-5-21'
Turtle is a human-readable linked data format.
curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1186/1748-7188-5-21'
RDF/XML is a standard XML format for linked data.
curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1186/1748-7188-5-21'
This table displays all metadata directly associated to this object as RDF triples.
161 TRIPLES
22 PREDICATES
84 URIs
72 LITERALS
7 BLANK NODES