External memory BWT and LCP computation for sequence collections with applications View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2019-12

AUTHORS

Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, Guilherme P. Telles

ABSTRACT

Background: Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. Results: We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs. Conclusions: We prove that our algorithm performs O ( n maxlcp ) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input. More... »

PAGES

6

References to SciGraph publications

  • 2004. Two Space Saving Tricks for Linear Time LCP Array Computation in ALGORITHM THEORY - SWAT 2004
  • 2017-09-06. Lightweight BWT and LCP Merging via the Gap Algorithm in STRING PROCESSING AND INFORMATION RETRIEVAL
  • 2017-12. Generalized enhanced suffix array construction in external memory in ALGORITHMS FOR MOLECULAR BIOLOGY
  • 2016. Bidirectional Variable-Order de Bruijn Graphs in LATIN 2016: THEORETICAL INFORMATICS
  • 2003. Fast Lightweight Suffix Array Construction and Checking in COMBINATORIAL PATTERN MATCHING
  • 2012. Succinct de Bruijn Graphs in ALGORITHMS IN BIOINFORMATICS
  • 2010. Lightweight Data Indexing and Compression in External Memory in LATIN 2010: THEORETICAL INFORMATICS
  • 2010. Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes in STRING PROCESSING AND INFORMATION RETRIEVAL
  • 2002. Engineering a Lightweight Suffix Array Construction Algorithm in ALGORITHMS — ESA 2002
  • 2017-06. An External-Memory Algorithm for String Graph Construction in ALGORITHMICA
  • 2017-06. Engineering a Lightweight External Memory Suffix Array Construction Algorithm in MATHEMATICS IN COMPUTER SCIENCE
  • 2014. Constructing String Graphs in External Memory in ALGORITHMS IN BIOINFORMATICS
  • 2018-09-14. The Colored Longest Common Prefix Array Computed via Sequential Scans in STRING PROCESSING AND INFORMATION RETRIEVAL
  • 2018-09-14. Computing Burrows-Wheeler Similarity Distributions for String Collections in STRING PROCESSING AND INFORMATION RETRIEVAL
  • 2013. Space-Efficient Construction of the Burrows-Wheeler Transform in STRING PROCESSING AND INFORMATION RETRIEVAL
  • Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1186/s13015-019-0140-0

    DOI

    http://dx.doi.org/10.1186/s13015-019-0140-0

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1112645996

    PUBMED

    https://www.ncbi.nlm.nih.gov/pubmed/30899322


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0802", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Computation Theory and Mathematics", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "University of Eastern Piedmont Amadeo Avogadro", 
              "id": "https://www.grid.ac/institutes/grid.16563.37", 
              "name": [
                "DiSIT, University of Eastern Piedmont, Viale Michel, 11, 15121, Alessandria, Italy"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Egidi", 
            "givenName": "Lavinia", 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "University of Sao Paulo", 
              "id": "https://www.grid.ac/institutes/grid.11899.38", 
              "name": [
                "Department of Computing and Mathematics, University of S\u00e3o Paulo, Av. Bandeirantes, 3900, 14040-901, Ribeir\u00e3o Preto, Brazil"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Louza", 
            "givenName": "Felipe A.", 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "Institute of Informatics and Telematics", 
              "id": "https://www.grid.ac/institutes/grid.473659.a", 
              "name": [
                "DiSIT, University of Eastern Piedmont, Viale Michel, 11, 15121, Alessandria, Italy", 
                "IIT CNR, Via Moruzzi, 1, 56124, Pisa, Italy"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Manzini", 
            "givenName": "Giovanni", 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "State University of Campinas", 
              "id": "https://www.grid.ac/institutes/grid.411087.b", 
              "name": [
                "Institute of Computing, University of Campinas, Av. Albert Einstein, 1251, 13083-852, Campinas, Brazil"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Telles", 
            "givenName": "Guilherme P.", 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "https://doi.org/10.1002/spe.844", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1001233322"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/2444016.2461327", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1012935447"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/1216370.1216372", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1013726603"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-540-27810-8_32", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1018610105", 
              "https://doi.org/10.1007/978-3-540-27810-8_32"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-540-27810-8_32", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1018610105", 
              "https://doi.org/10.1007/978-3-540-27810-8_32"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-16321-0_36", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1020531886", 
              "https://doi.org/10.1007/978-3-642-16321-0_36"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-16321-0_36", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1020531886", 
              "https://doi.org/10.1007/978-3-642-16321-0_36"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-319-02432-5_5", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1022178748", 
              "https://doi.org/10.1007/978-3-319-02432-5_5"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/j.tcs.2012.02.002", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1022562794"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/j.jda.2016.03.003", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1025682963"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-33122-0_18", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1027417410", 
              "https://doi.org/10.1007/978-3-642-33122-0_18"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/2591796.2591885", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1029999221"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/j.ipl.2009.10.015", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1032471155"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-662-49529-2_13", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1033118390", 
              "https://doi.org/10.1007/978-3-662-49529-2_13"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/s00453-016-0165-4", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1033175832", 
              "https://doi.org/10.1007/s00453-016-0165-4"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/2493175.2493180", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1033671098"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/2851491", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1035474229"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/384192.384193", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1040182656"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-45749-6_61", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1042900552", 
              "https://doi.org/10.1007/3-540-45749-6_61"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-12200-2_60", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1043252920", 
              "https://doi.org/10.1007/978-3-642-12200-2_60"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-12200-2_60", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1043252920", 
              "https://doi.org/10.1007/978-3-642-12200-2_60"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/j.tcs.2007.07.014", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1043471141"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-662-44753-6_23", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1046567866", 
              "https://doi.org/10.1007/978-3-662-44753-6_23"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/2649387.2649431", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1046847037"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1093/bioinformatics/btu584", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1049597276"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/j.jda.2016.04.002", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1049848652"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-44888-8_5", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1051255923", 
              "https://doi.org/10.1007/3-540-44888-8_5"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/0020-0190(92)90176-v", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1052344476"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/tcbb.2011.127", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1061540869"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1137/0222058", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1062842461"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/s11786-016-0281-1", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1083757671", 
              "https://doi.org/10.1007/s11786-016-0281-1"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/s11786-016-0281-1", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1083757671", 
              "https://doi.org/10.1007/s11786-016-0281-1"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1093/bioinformatics/btx067", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1083840416"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/j.tcs.2017.03.039", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1084535347"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-319-67428-5_15", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1091476846", 
              "https://doi.org/10.1007/978-3-319-67428-5_15"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-319-67428-5_15", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1091476846", 
              "https://doi.org/10.1007/978-3-319-67428-5_15"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/dcc.2015.70", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1094119687"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1137/1.9781611974782.26", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1098555285"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1017/cbo9781139940023", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1098672764"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1017/cbo9780511574931", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1098674485"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1186/s13015-017-0117-9", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1099695383", 
              "https://doi.org/10.1186/s13015-017-0117-9"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1371/journal.pcbi.1005944", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1100667150"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-030-00479-8_13", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1106966951", 
              "https://doi.org/10.1007/978-3-030-00479-8_13"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-030-00479-8_13", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1106966951", 
              "https://doi.org/10.1007/978-3-030-00479-8_13"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-030-00479-8_23", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1106966962", 
              "https://doi.org/10.1007/978-3-030-00479-8_23"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-030-00479-8_23", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1106966962", 
              "https://doi.org/10.1007/978-3-030-00479-8_23"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2019-12", 
        "datePublishedReg": "2019-12-01", 
        "description": "Background: Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.\nResults: We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs.\nConclusions: We prove that our algorithm performs O ( n  maxlcp )  sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.", 
        "genre": "research_article", 
        "id": "sg:pub.10.1186/s13015-019-0140-0", 
        "inLanguage": [
          "en"
        ], 
        "isAccessibleForFree": true, 
        "isFundedItemOf": [
          {
            "id": "sg:grant.6852716", 
            "type": "MonetaryGrant"
          }, 
          {
            "id": "sg:grant.4557966", 
            "type": "MonetaryGrant"
          }, 
          {
            "id": "sg:grant.4491249", 
            "type": "MonetaryGrant"
          }
        ], 
        "isPartOf": [
          {
            "id": "sg:journal.1036449", 
            "issn": [
              "1748-7188"
            ], 
            "name": "Algorithms for Molecular Biology", 
            "type": "Periodical"
          }, 
          {
            "issueNumber": "1", 
            "type": "PublicationIssue"
          }, 
          {
            "type": "PublicationVolume", 
            "volumeNumber": "14"
          }
        ], 
        "name": "External memory BWT and LCP computation for sequence collections with applications", 
        "pagination": "6", 
        "productId": [
          {
            "name": "readcube_id", 
            "type": "PropertyValue", 
            "value": [
              "b95092b3a0e74be7932a9bf8040bb719d524c54f2045912ec7dc489b4807b872"
            ]
          }, 
          {
            "name": "pubmed_id", 
            "type": "PropertyValue", 
            "value": [
              "30899322"
            ]
          }, 
          {
            "name": "nlm_unique_id", 
            "type": "PropertyValue", 
            "value": [
              "101265088"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1186/s13015-019-0140-0"
            ]
          }, 
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1112645996"
            ]
          }
        ], 
        "sameAs": [
          "https://doi.org/10.1186/s13015-019-0140-0", 
          "https://app.dimensions.ai/details/publication/pub.1112645996"
        ], 
        "sdDataset": "articles", 
        "sdDatePublished": "2019-04-11T13:17", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000368_0000000368/records_78934_00000001.jsonl", 
        "type": "ScholarlyArticle", 
        "url": "https://link.springer.com/10.1186%2Fs13015-019-0140-0"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1186/s13015-019-0140-0'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1186/s13015-019-0140-0'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1186/s13015-019-0140-0'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1186/s13015-019-0140-0'


     

    This table displays all metadata directly associated to this object as RDF triples.

    233 TRIPLES      21 PREDICATES      68 URIs      21 LITERALS      9 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1186/s13015-019-0140-0 schema:about anzsrc-for:08
    2 anzsrc-for:0802
    3 schema:author Nc85691c21b86404d949c7281f343543c
    4 schema:citation sg:pub.10.1007/3-540-44888-8_5
    5 sg:pub.10.1007/3-540-45749-6_61
    6 sg:pub.10.1007/978-3-030-00479-8_13
    7 sg:pub.10.1007/978-3-030-00479-8_23
    8 sg:pub.10.1007/978-3-319-02432-5_5
    9 sg:pub.10.1007/978-3-319-67428-5_15
    10 sg:pub.10.1007/978-3-540-27810-8_32
    11 sg:pub.10.1007/978-3-642-12200-2_60
    12 sg:pub.10.1007/978-3-642-16321-0_36
    13 sg:pub.10.1007/978-3-642-33122-0_18
    14 sg:pub.10.1007/978-3-662-44753-6_23
    15 sg:pub.10.1007/978-3-662-49529-2_13
    16 sg:pub.10.1007/s00453-016-0165-4
    17 sg:pub.10.1007/s11786-016-0281-1
    18 sg:pub.10.1186/s13015-017-0117-9
    19 https://doi.org/10.1002/spe.844
    20 https://doi.org/10.1016/0020-0190(92)90176-v
    21 https://doi.org/10.1016/j.ipl.2009.10.015
    22 https://doi.org/10.1016/j.jda.2016.03.003
    23 https://doi.org/10.1016/j.jda.2016.04.002
    24 https://doi.org/10.1016/j.tcs.2007.07.014
    25 https://doi.org/10.1016/j.tcs.2012.02.002
    26 https://doi.org/10.1016/j.tcs.2017.03.039
    27 https://doi.org/10.1017/cbo9780511574931
    28 https://doi.org/10.1017/cbo9781139940023
    29 https://doi.org/10.1093/bioinformatics/btu584
    30 https://doi.org/10.1093/bioinformatics/btx067
    31 https://doi.org/10.1109/dcc.2015.70
    32 https://doi.org/10.1109/tcbb.2011.127
    33 https://doi.org/10.1137/0222058
    34 https://doi.org/10.1137/1.9781611974782.26
    35 https://doi.org/10.1145/1216370.1216372
    36 https://doi.org/10.1145/2444016.2461327
    37 https://doi.org/10.1145/2493175.2493180
    38 https://doi.org/10.1145/2591796.2591885
    39 https://doi.org/10.1145/2649387.2649431
    40 https://doi.org/10.1145/2851491
    41 https://doi.org/10.1145/384192.384193
    42 https://doi.org/10.1371/journal.pcbi.1005944
    43 schema:datePublished 2019-12
    44 schema:datePublishedReg 2019-12-01
    45 schema:description Background: Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. Results: We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs. Conclusions: We prove that our algorithm performs O ( n maxlcp ) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.
    46 schema:genre research_article
    47 schema:inLanguage en
    48 schema:isAccessibleForFree true
    49 schema:isPartOf N573b3c426b894d9d8d6175cb7f40c985
    50 Nc4237f58694340059349f6dd7e094cec
    51 sg:journal.1036449
    52 schema:name External memory BWT and LCP computation for sequence collections with applications
    53 schema:pagination 6
    54 schema:productId N3d8bc51002624e31a13d3675ffd5c86a
    55 Na9515d55e6bd4a778770b9827b8fd3ec
    56 Ncbbc16bed15645cbbb749586b4d71f8a
    57 Nec94a4f0e45d4ec98efa5a20470ccf36
    58 Nfc6253eac13845129c549aa5ca5f5710
    59 schema:sameAs https://app.dimensions.ai/details/publication/pub.1112645996
    60 https://doi.org/10.1186/s13015-019-0140-0
    61 schema:sdDatePublished 2019-04-11T13:17
    62 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    63 schema:sdPublisher Nfe5124f4ece64663bb0e5f84d72e0dff
    64 schema:url https://link.springer.com/10.1186%2Fs13015-019-0140-0
    65 sgo:license sg:explorer/license/
    66 sgo:sdDataset articles
    67 rdf:type schema:ScholarlyArticle
    68 N0e069ccca257421e857feac0914bb9eb schema:affiliation https://www.grid.ac/institutes/grid.11899.38
    69 schema:familyName Louza
    70 schema:givenName Felipe A.
    71 rdf:type schema:Person
    72 N3d8bc51002624e31a13d3675ffd5c86a schema:name pubmed_id
    73 schema:value 30899322
    74 rdf:type schema:PropertyValue
    75 N573b3c426b894d9d8d6175cb7f40c985 schema:issueNumber 1
    76 rdf:type schema:PublicationIssue
    77 N682a4e92a1384e04afdabe122dd1eaf8 schema:affiliation https://www.grid.ac/institutes/grid.411087.b
    78 schema:familyName Telles
    79 schema:givenName Guilherme P.
    80 rdf:type schema:Person
    81 N6e1a821c251f4099a406f0ae8d71446c schema:affiliation https://www.grid.ac/institutes/grid.16563.37
    82 schema:familyName Egidi
    83 schema:givenName Lavinia
    84 rdf:type schema:Person
    85 N95d626fd0dac45f58a92dd5476d30642 schema:affiliation https://www.grid.ac/institutes/grid.473659.a
    86 schema:familyName Manzini
    87 schema:givenName Giovanni
    88 rdf:type schema:Person
    89 Na9515d55e6bd4a778770b9827b8fd3ec schema:name nlm_unique_id
    90 schema:value 101265088
    91 rdf:type schema:PropertyValue
    92 Nab2f8f848e5b484d86d02429a25aadae rdf:first N682a4e92a1384e04afdabe122dd1eaf8
    93 rdf:rest rdf:nil
    94 Nb593e6970b1346418bba65d1f25feb66 rdf:first N95d626fd0dac45f58a92dd5476d30642
    95 rdf:rest Nab2f8f848e5b484d86d02429a25aadae
    96 Nc4237f58694340059349f6dd7e094cec schema:volumeNumber 14
    97 rdf:type schema:PublicationVolume
    98 Nc85691c21b86404d949c7281f343543c rdf:first N6e1a821c251f4099a406f0ae8d71446c
    99 rdf:rest Nf2c4019efb524a84be596de0634de15f
    100 Ncbbc16bed15645cbbb749586b4d71f8a schema:name dimensions_id
    101 schema:value pub.1112645996
    102 rdf:type schema:PropertyValue
    103 Nec94a4f0e45d4ec98efa5a20470ccf36 schema:name doi
    104 schema:value 10.1186/s13015-019-0140-0
    105 rdf:type schema:PropertyValue
    106 Nf2c4019efb524a84be596de0634de15f rdf:first N0e069ccca257421e857feac0914bb9eb
    107 rdf:rest Nb593e6970b1346418bba65d1f25feb66
    108 Nfc6253eac13845129c549aa5ca5f5710 schema:name readcube_id
    109 schema:value b95092b3a0e74be7932a9bf8040bb719d524c54f2045912ec7dc489b4807b872
    110 rdf:type schema:PropertyValue
    111 Nfe5124f4ece64663bb0e5f84d72e0dff schema:name Springer Nature - SN SciGraph project
    112 rdf:type schema:Organization
    113 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    114 schema:name Information and Computing Sciences
    115 rdf:type schema:DefinedTerm
    116 anzsrc-for:0802 schema:inDefinedTermSet anzsrc-for:
    117 schema:name Computation Theory and Mathematics
    118 rdf:type schema:DefinedTerm
    119 sg:grant.4491249 http://pending.schema.org/fundedItem sg:pub.10.1186/s13015-019-0140-0
    120 rdf:type schema:MonetaryGrant
    121 sg:grant.4557966 http://pending.schema.org/fundedItem sg:pub.10.1186/s13015-019-0140-0
    122 rdf:type schema:MonetaryGrant
    123 sg:grant.6852716 http://pending.schema.org/fundedItem sg:pub.10.1186/s13015-019-0140-0
    124 rdf:type schema:MonetaryGrant
    125 sg:journal.1036449 schema:issn 1748-7188
    126 schema:name Algorithms for Molecular Biology
    127 rdf:type schema:Periodical
    128 sg:pub.10.1007/3-540-44888-8_5 schema:sameAs https://app.dimensions.ai/details/publication/pub.1051255923
    129 https://doi.org/10.1007/3-540-44888-8_5
    130 rdf:type schema:CreativeWork
    131 sg:pub.10.1007/3-540-45749-6_61 schema:sameAs https://app.dimensions.ai/details/publication/pub.1042900552
    132 https://doi.org/10.1007/3-540-45749-6_61
    133 rdf:type schema:CreativeWork
    134 sg:pub.10.1007/978-3-030-00479-8_13 schema:sameAs https://app.dimensions.ai/details/publication/pub.1106966951
    135 https://doi.org/10.1007/978-3-030-00479-8_13
    136 rdf:type schema:CreativeWork
    137 sg:pub.10.1007/978-3-030-00479-8_23 schema:sameAs https://app.dimensions.ai/details/publication/pub.1106966962
    138 https://doi.org/10.1007/978-3-030-00479-8_23
    139 rdf:type schema:CreativeWork
    140 sg:pub.10.1007/978-3-319-02432-5_5 schema:sameAs https://app.dimensions.ai/details/publication/pub.1022178748
    141 https://doi.org/10.1007/978-3-319-02432-5_5
    142 rdf:type schema:CreativeWork
    143 sg:pub.10.1007/978-3-319-67428-5_15 schema:sameAs https://app.dimensions.ai/details/publication/pub.1091476846
    144 https://doi.org/10.1007/978-3-319-67428-5_15
    145 rdf:type schema:CreativeWork
    146 sg:pub.10.1007/978-3-540-27810-8_32 schema:sameAs https://app.dimensions.ai/details/publication/pub.1018610105
    147 https://doi.org/10.1007/978-3-540-27810-8_32
    148 rdf:type schema:CreativeWork
    149 sg:pub.10.1007/978-3-642-12200-2_60 schema:sameAs https://app.dimensions.ai/details/publication/pub.1043252920
    150 https://doi.org/10.1007/978-3-642-12200-2_60
    151 rdf:type schema:CreativeWork
    152 sg:pub.10.1007/978-3-642-16321-0_36 schema:sameAs https://app.dimensions.ai/details/publication/pub.1020531886
    153 https://doi.org/10.1007/978-3-642-16321-0_36
    154 rdf:type schema:CreativeWork
    155 sg:pub.10.1007/978-3-642-33122-0_18 schema:sameAs https://app.dimensions.ai/details/publication/pub.1027417410
    156 https://doi.org/10.1007/978-3-642-33122-0_18
    157 rdf:type schema:CreativeWork
    158 sg:pub.10.1007/978-3-662-44753-6_23 schema:sameAs https://app.dimensions.ai/details/publication/pub.1046567866
    159 https://doi.org/10.1007/978-3-662-44753-6_23
    160 rdf:type schema:CreativeWork
    161 sg:pub.10.1007/978-3-662-49529-2_13 schema:sameAs https://app.dimensions.ai/details/publication/pub.1033118390
    162 https://doi.org/10.1007/978-3-662-49529-2_13
    163 rdf:type schema:CreativeWork
    164 sg:pub.10.1007/s00453-016-0165-4 schema:sameAs https://app.dimensions.ai/details/publication/pub.1033175832
    165 https://doi.org/10.1007/s00453-016-0165-4
    166 rdf:type schema:CreativeWork
    167 sg:pub.10.1007/s11786-016-0281-1 schema:sameAs https://app.dimensions.ai/details/publication/pub.1083757671
    168 https://doi.org/10.1007/s11786-016-0281-1
    169 rdf:type schema:CreativeWork
    170 sg:pub.10.1186/s13015-017-0117-9 schema:sameAs https://app.dimensions.ai/details/publication/pub.1099695383
    171 https://doi.org/10.1186/s13015-017-0117-9
    172 rdf:type schema:CreativeWork
    173 https://doi.org/10.1002/spe.844 schema:sameAs https://app.dimensions.ai/details/publication/pub.1001233322
    174 rdf:type schema:CreativeWork
    175 https://doi.org/10.1016/0020-0190(92)90176-v schema:sameAs https://app.dimensions.ai/details/publication/pub.1052344476
    176 rdf:type schema:CreativeWork
    177 https://doi.org/10.1016/j.ipl.2009.10.015 schema:sameAs https://app.dimensions.ai/details/publication/pub.1032471155
    178 rdf:type schema:CreativeWork
    179 https://doi.org/10.1016/j.jda.2016.03.003 schema:sameAs https://app.dimensions.ai/details/publication/pub.1025682963
    180 rdf:type schema:CreativeWork
    181 https://doi.org/10.1016/j.jda.2016.04.002 schema:sameAs https://app.dimensions.ai/details/publication/pub.1049848652
    182 rdf:type schema:CreativeWork
    183 https://doi.org/10.1016/j.tcs.2007.07.014 schema:sameAs https://app.dimensions.ai/details/publication/pub.1043471141
    184 rdf:type schema:CreativeWork
    185 https://doi.org/10.1016/j.tcs.2012.02.002 schema:sameAs https://app.dimensions.ai/details/publication/pub.1022562794
    186 rdf:type schema:CreativeWork
    187 https://doi.org/10.1016/j.tcs.2017.03.039 schema:sameAs https://app.dimensions.ai/details/publication/pub.1084535347
    188 rdf:type schema:CreativeWork
    189 https://doi.org/10.1017/cbo9780511574931 schema:sameAs https://app.dimensions.ai/details/publication/pub.1098674485
    190 rdf:type schema:CreativeWork
    191 https://doi.org/10.1017/cbo9781139940023 schema:sameAs https://app.dimensions.ai/details/publication/pub.1098672764
    192 rdf:type schema:CreativeWork
    193 https://doi.org/10.1093/bioinformatics/btu584 schema:sameAs https://app.dimensions.ai/details/publication/pub.1049597276
    194 rdf:type schema:CreativeWork
    195 https://doi.org/10.1093/bioinformatics/btx067 schema:sameAs https://app.dimensions.ai/details/publication/pub.1083840416
    196 rdf:type schema:CreativeWork
    197 https://doi.org/10.1109/dcc.2015.70 schema:sameAs https://app.dimensions.ai/details/publication/pub.1094119687
    198 rdf:type schema:CreativeWork
    199 https://doi.org/10.1109/tcbb.2011.127 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061540869
    200 rdf:type schema:CreativeWork
    201 https://doi.org/10.1137/0222058 schema:sameAs https://app.dimensions.ai/details/publication/pub.1062842461
    202 rdf:type schema:CreativeWork
    203 https://doi.org/10.1137/1.9781611974782.26 schema:sameAs https://app.dimensions.ai/details/publication/pub.1098555285
    204 rdf:type schema:CreativeWork
    205 https://doi.org/10.1145/1216370.1216372 schema:sameAs https://app.dimensions.ai/details/publication/pub.1013726603
    206 rdf:type schema:CreativeWork
    207 https://doi.org/10.1145/2444016.2461327 schema:sameAs https://app.dimensions.ai/details/publication/pub.1012935447
    208 rdf:type schema:CreativeWork
    209 https://doi.org/10.1145/2493175.2493180 schema:sameAs https://app.dimensions.ai/details/publication/pub.1033671098
    210 rdf:type schema:CreativeWork
    211 https://doi.org/10.1145/2591796.2591885 schema:sameAs https://app.dimensions.ai/details/publication/pub.1029999221
    212 rdf:type schema:CreativeWork
    213 https://doi.org/10.1145/2649387.2649431 schema:sameAs https://app.dimensions.ai/details/publication/pub.1046847037
    214 rdf:type schema:CreativeWork
    215 https://doi.org/10.1145/2851491 schema:sameAs https://app.dimensions.ai/details/publication/pub.1035474229
    216 rdf:type schema:CreativeWork
    217 https://doi.org/10.1145/384192.384193 schema:sameAs https://app.dimensions.ai/details/publication/pub.1040182656
    218 rdf:type schema:CreativeWork
    219 https://doi.org/10.1371/journal.pcbi.1005944 schema:sameAs https://app.dimensions.ai/details/publication/pub.1100667150
    220 rdf:type schema:CreativeWork
    221 https://www.grid.ac/institutes/grid.11899.38 schema:alternateName University of Sao Paulo
    222 schema:name Department of Computing and Mathematics, University of São Paulo, Av. Bandeirantes, 3900, 14040-901, Ribeirão Preto, Brazil
    223 rdf:type schema:Organization
    224 https://www.grid.ac/institutes/grid.16563.37 schema:alternateName University of Eastern Piedmont Amadeo Avogadro
    225 schema:name DiSIT, University of Eastern Piedmont, Viale Michel, 11, 15121, Alessandria, Italy
    226 rdf:type schema:Organization
    227 https://www.grid.ac/institutes/grid.411087.b schema:alternateName State University of Campinas
    228 schema:name Institute of Computing, University of Campinas, Av. Albert Einstein, 1251, 13083-852, Campinas, Brazil
    229 rdf:type schema:Organization
    230 https://www.grid.ac/institutes/grid.473659.a schema:alternateName Institute of Informatics and Telematics
    231 schema:name DiSIT, University of Eastern Piedmont, Viale Michel, 11, 15121, Alessandria, Italy
    232 IIT CNR, Via Moruzzi, 1, 56124, Pisa, Italy
    233 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...