Boosting Pattern Matching Performance via k-bit Filtering View Full Text


Ontology type: schema:Chapter     


Chapter Info

DATE

2010-08-18

AUTHORS

M. Oğuzhan Külekci , Jeffrey Scott Vitter , Bojian Xu

ABSTRACT

This study explores an alternative way of storing text files in a difierent format that will speed up the searching process. The input file is decomposed into two parts as filter and payload. Filter part is composed of most informative k-bits of each byte from the original file. Remaining bits form the payload. Selection of the most informative bits are achieved according to their entropy. When an input pattern is to be searched on the new file structure, same decomposition is performed on the pattern. The filter part of the pattern is queried in the filter part of the file following by a verification process of the payload for the matching positions. Experiments conducted on natural language texts, plain ascii DNA sequences, and random byte sequences showed that the search performance with the proposed scheme is on the average two times faster than the tested exact pattern matching algorithms. More... »

PAGES

27-32

References to SciGraph publications

  • 2007. Fast Matching Method for DNA Sequences in COMBINATORICS, ALGORITHMS, PROBABILISTIC AND EXPERIMENTAL METHODOLOGIES
  • 2002-04-19. Factor Oracle: A New Structure for Pattern Matching in SOFSEM’99: THEORY AND PRACTICE OF INFORMATICS
  • 2007. Accelerating Boyer Moore Searches on Binary Texts in IMPLEMENTATION AND APPLICATION OF AUTOMATA
  • Book

    TITLE

    Computer and Information Sciences

    ISBN

    978-90-481-9793-4
    978-90-481-9794-1

    Author Affiliations

    Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/978-90-481-9794-1_6

    DOI

    http://dx.doi.org/10.1007/978-90-481-9794-1_6

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1042893294


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Artificial Intelligence and Image Processing", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "Texas A&M University", 
              "id": "https://www.grid.ac/institutes/grid.264756.4", 
              "name": [
                "Department of Computer Science & Engineering, Texas A&M University, College Station, TX, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "K\u00fclekci", 
            "givenName": "M. O\u011fuzhan", 
            "id": "sg:person.011026615332.33", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011026615332.33"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "Texas A&M University", 
              "id": "https://www.grid.ac/institutes/grid.264756.4", 
              "name": [
                "Department of Computer Science & Engineering, Texas A&M University, College Station, TX, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Vitter", 
            "givenName": "Jeffrey Scott", 
            "id": "sg:person.0613677314.28", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0613677314.28"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "Texas A&M University", 
              "id": "https://www.grid.ac/institutes/grid.264756.4", 
              "name": [
                "Department of Computer Science & Engineering, Texas A&M University, College Station, TX, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Xu", 
            "givenName": "Bojian", 
            "id": "sg:person.01205767644.44", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01205767644.44"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "sg:pub.10.1007/978-3-540-76336-9_14", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1001522965", 
              "https://doi.org/10.1007/978-3-540-76336-9_14"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-540-76336-9_14", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1001522965", 
              "https://doi.org/10.1007/978-3-540-76336-9_14"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-47849-3_18", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1003836697", 
              "https://doi.org/10.1007/3-540-47849-3_18"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-47849-3_18", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1003836697", 
              "https://doi.org/10.1007/3-540-47849-3_18"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/j.ipl.2007.01.002", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1037097960"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/79173.79184", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1037364690"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-540-74450-4_25", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1045271369", 
              "https://doi.org/10.1007/978-3-540-74450-4_25"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-540-74450-4_25", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1045271369", 
              "https://doi.org/10.1007/978-3-540-74450-4_25"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/359842.359859", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1051315984"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1002/j.1538-7305.1948.tb01338.x", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1052867467"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1137/s0097539702402354", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1062879354"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2010-08-18", 
        "datePublishedReg": "2010-08-18", 
        "description": "This study explores an alternative way of storing text files in a difierent format that will speed up the searching process. The input file is decomposed into two parts as filter and payload. Filter part is composed of most informative k-bits of each byte from the original file. Remaining bits form the payload. Selection of the most informative bits are achieved according to their entropy. When an input pattern is to be searched on the new file structure, same decomposition is performed on the pattern. The filter part of the pattern is queried in the filter part of the file following by a verification process of the payload for the matching positions. Experiments conducted on natural language texts, plain ascii DNA sequences, and random byte sequences showed that the search performance with the proposed scheme is on the average two times faster than the tested exact pattern matching algorithms.", 
        "editor": [
          {
            "familyName": "Gelenbe", 
            "givenName": "Erol", 
            "type": "Person"
          }, 
          {
            "familyName": "Lent", 
            "givenName": "Ricardo", 
            "type": "Person"
          }, 
          {
            "familyName": "Sakellari", 
            "givenName": "Georgia", 
            "type": "Person"
          }, 
          {
            "familyName": "Sacan", 
            "givenName": "Ahmet", 
            "type": "Person"
          }, 
          {
            "familyName": "Toroslu", 
            "givenName": "Hakki", 
            "type": "Person"
          }, 
          {
            "familyName": "Yazici", 
            "givenName": "Adnan", 
            "type": "Person"
          }
        ], 
        "genre": "chapter", 
        "id": "sg:pub.10.1007/978-90-481-9794-1_6", 
        "inLanguage": [
          "en"
        ], 
        "isAccessibleForFree": false, 
        "isPartOf": {
          "isbn": [
            "978-90-481-9793-4", 
            "978-90-481-9794-1"
          ], 
          "name": "Computer and Information Sciences", 
          "type": "Book"
        }, 
        "name": "Boosting Pattern Matching Performance via k-bit Filtering", 
        "pagination": "27-32", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1042893294"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/978-90-481-9794-1_6"
            ]
          }, 
          {
            "name": "readcube_id", 
            "type": "PropertyValue", 
            "value": [
              "cc5141dfaee04ed13107ccdd966b9bd061229ab4ccf4cbeb9f595e32315b1442"
            ]
          }
        ], 
        "publisher": {
          "location": "Dordrecht", 
          "name": "Springer Netherlands", 
          "type": "Organisation"
        }, 
        "sameAs": [
          "https://doi.org/10.1007/978-90-481-9794-1_6", 
          "https://app.dimensions.ai/details/publication/pub.1042893294"
        ], 
        "sdDataset": "chapters", 
        "sdDatePublished": "2019-04-16T08:22", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000363_0000000363/records_70032_00000001.jsonl", 
        "type": "Chapter", 
        "url": "https://link.springer.com/10.1007%2F978-90-481-9794-1_6"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-90-481-9794-1_6'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-90-481-9794-1_6'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-90-481-9794-1_6'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-90-481-9794-1_6'


     

    This table displays all metadata directly associated to this object as RDF triples.

    131 TRIPLES      23 PREDICATES      34 URIs      19 LITERALS      8 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/978-90-481-9794-1_6 schema:about anzsrc-for:08
    2 anzsrc-for:0801
    3 schema:author N7591089cb8c54929b315fc8acfae8d11
    4 schema:citation sg:pub.10.1007/3-540-47849-3_18
    5 sg:pub.10.1007/978-3-540-74450-4_25
    6 sg:pub.10.1007/978-3-540-76336-9_14
    7 https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
    8 https://doi.org/10.1016/j.ipl.2007.01.002
    9 https://doi.org/10.1137/s0097539702402354
    10 https://doi.org/10.1145/359842.359859
    11 https://doi.org/10.1145/79173.79184
    12 schema:datePublished 2010-08-18
    13 schema:datePublishedReg 2010-08-18
    14 schema:description This study explores an alternative way of storing text files in a difierent format that will speed up the searching process. The input file is decomposed into two parts as filter and payload. Filter part is composed of most informative k-bits of each byte from the original file. Remaining bits form the payload. Selection of the most informative bits are achieved according to their entropy. When an input pattern is to be searched on the new file structure, same decomposition is performed on the pattern. The filter part of the pattern is queried in the filter part of the file following by a verification process of the payload for the matching positions. Experiments conducted on natural language texts, plain ascii DNA sequences, and random byte sequences showed that the search performance with the proposed scheme is on the average two times faster than the tested exact pattern matching algorithms.
    15 schema:editor N90a758c26dfa4e60a4d038515f35bb5d
    16 schema:genre chapter
    17 schema:inLanguage en
    18 schema:isAccessibleForFree false
    19 schema:isPartOf Nbf21c48bcd33477e8ae5462b0f83a1d0
    20 schema:name Boosting Pattern Matching Performance via k-bit Filtering
    21 schema:pagination 27-32
    22 schema:productId N196f958b26e143dcb5ac2361a85f236e
    23 N4d59111192064fa4bc449c6141e59cf5
    24 N57e11d5b61524084861ebdf5dda44b1f
    25 schema:publisher N88b591ed8e174c77a014e76a9d78bb12
    26 schema:sameAs https://app.dimensions.ai/details/publication/pub.1042893294
    27 https://doi.org/10.1007/978-90-481-9794-1_6
    28 schema:sdDatePublished 2019-04-16T08:22
    29 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    30 schema:sdPublisher Ne7c8c60560bf4e14a7e1a23687f5d09a
    31 schema:url https://link.springer.com/10.1007%2F978-90-481-9794-1_6
    32 sgo:license sg:explorer/license/
    33 sgo:sdDataset chapters
    34 rdf:type schema:Chapter
    35 N196f958b26e143dcb5ac2361a85f236e schema:name dimensions_id
    36 schema:value pub.1042893294
    37 rdf:type schema:PropertyValue
    38 N275f2d895cb843698fa54871ea7a575e rdf:first sg:person.01205767644.44
    39 rdf:rest rdf:nil
    40 N4c964ecaddfa412eb3010db471fb7834 rdf:first Nc14e3325fcd740118049a48a1ced98b0
    41 rdf:rest N7b1cf7ab6ae6461396bf57a67f33ab8a
    42 N4d59111192064fa4bc449c6141e59cf5 schema:name doi
    43 schema:value 10.1007/978-90-481-9794-1_6
    44 rdf:type schema:PropertyValue
    45 N542d3d99a0484b50ad4abd83fd0683f5 rdf:first sg:person.0613677314.28
    46 rdf:rest N275f2d895cb843698fa54871ea7a575e
    47 N57e11d5b61524084861ebdf5dda44b1f schema:name readcube_id
    48 schema:value cc5141dfaee04ed13107ccdd966b9bd061229ab4ccf4cbeb9f595e32315b1442
    49 rdf:type schema:PropertyValue
    50 N5b1168ba44ac4680a6877be33237d874 schema:familyName Lent
    51 schema:givenName Ricardo
    52 rdf:type schema:Person
    53 N758c63c6855a4215829d8fb332ed589c schema:familyName Sacan
    54 schema:givenName Ahmet
    55 rdf:type schema:Person
    56 N7591089cb8c54929b315fc8acfae8d11 rdf:first sg:person.011026615332.33
    57 rdf:rest N542d3d99a0484b50ad4abd83fd0683f5
    58 N7b1cf7ab6ae6461396bf57a67f33ab8a rdf:first N758c63c6855a4215829d8fb332ed589c
    59 rdf:rest Naa0d2ac9ecd1412588ceee8f03ba54de
    60 N88b591ed8e174c77a014e76a9d78bb12 schema:location Dordrecht
    61 schema:name Springer Netherlands
    62 rdf:type schema:Organisation
    63 N90a758c26dfa4e60a4d038515f35bb5d rdf:first Na78c01e7aad246b3a9f1d0353cb7a661
    64 rdf:rest Nf6044759c8814afbbab2ed54c91cec50
    65 N9e0457449241464b991c8ae6cd8cb61d schema:familyName Toroslu
    66 schema:givenName Hakki
    67 rdf:type schema:Person
    68 Na78c01e7aad246b3a9f1d0353cb7a661 schema:familyName Gelenbe
    69 schema:givenName Erol
    70 rdf:type schema:Person
    71 Naa0d2ac9ecd1412588ceee8f03ba54de rdf:first N9e0457449241464b991c8ae6cd8cb61d
    72 rdf:rest Nb88c70cd063445ce8eeb059353acc814
    73 Nb88c70cd063445ce8eeb059353acc814 rdf:first Nf32758abc63647769b57d6a5f2300fa8
    74 rdf:rest rdf:nil
    75 Nbf21c48bcd33477e8ae5462b0f83a1d0 schema:isbn 978-90-481-9793-4
    76 978-90-481-9794-1
    77 schema:name Computer and Information Sciences
    78 rdf:type schema:Book
    79 Nc14e3325fcd740118049a48a1ced98b0 schema:familyName Sakellari
    80 schema:givenName Georgia
    81 rdf:type schema:Person
    82 Ne7c8c60560bf4e14a7e1a23687f5d09a schema:name Springer Nature - SN SciGraph project
    83 rdf:type schema:Organization
    84 Nf32758abc63647769b57d6a5f2300fa8 schema:familyName Yazici
    85 schema:givenName Adnan
    86 rdf:type schema:Person
    87 Nf6044759c8814afbbab2ed54c91cec50 rdf:first N5b1168ba44ac4680a6877be33237d874
    88 rdf:rest N4c964ecaddfa412eb3010db471fb7834
    89 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    90 schema:name Information and Computing Sciences
    91 rdf:type schema:DefinedTerm
    92 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
    93 schema:name Artificial Intelligence and Image Processing
    94 rdf:type schema:DefinedTerm
    95 sg:person.011026615332.33 schema:affiliation https://www.grid.ac/institutes/grid.264756.4
    96 schema:familyName Külekci
    97 schema:givenName M. Oğuzhan
    98 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011026615332.33
    99 rdf:type schema:Person
    100 sg:person.01205767644.44 schema:affiliation https://www.grid.ac/institutes/grid.264756.4
    101 schema:familyName Xu
    102 schema:givenName Bojian
    103 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01205767644.44
    104 rdf:type schema:Person
    105 sg:person.0613677314.28 schema:affiliation https://www.grid.ac/institutes/grid.264756.4
    106 schema:familyName Vitter
    107 schema:givenName Jeffrey Scott
    108 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0613677314.28
    109 rdf:type schema:Person
    110 sg:pub.10.1007/3-540-47849-3_18 schema:sameAs https://app.dimensions.ai/details/publication/pub.1003836697
    111 https://doi.org/10.1007/3-540-47849-3_18
    112 rdf:type schema:CreativeWork
    113 sg:pub.10.1007/978-3-540-74450-4_25 schema:sameAs https://app.dimensions.ai/details/publication/pub.1045271369
    114 https://doi.org/10.1007/978-3-540-74450-4_25
    115 rdf:type schema:CreativeWork
    116 sg:pub.10.1007/978-3-540-76336-9_14 schema:sameAs https://app.dimensions.ai/details/publication/pub.1001522965
    117 https://doi.org/10.1007/978-3-540-76336-9_14
    118 rdf:type schema:CreativeWork
    119 https://doi.org/10.1002/j.1538-7305.1948.tb01338.x schema:sameAs https://app.dimensions.ai/details/publication/pub.1052867467
    120 rdf:type schema:CreativeWork
    121 https://doi.org/10.1016/j.ipl.2007.01.002 schema:sameAs https://app.dimensions.ai/details/publication/pub.1037097960
    122 rdf:type schema:CreativeWork
    123 https://doi.org/10.1137/s0097539702402354 schema:sameAs https://app.dimensions.ai/details/publication/pub.1062879354
    124 rdf:type schema:CreativeWork
    125 https://doi.org/10.1145/359842.359859 schema:sameAs https://app.dimensions.ai/details/publication/pub.1051315984
    126 rdf:type schema:CreativeWork
    127 https://doi.org/10.1145/79173.79184 schema:sameAs https://app.dimensions.ai/details/publication/pub.1037364690
    128 rdf:type schema:CreativeWork
    129 https://www.grid.ac/institutes/grid.264756.4 schema:alternateName Texas A&M University
    130 schema:name Department of Computer Science & Engineering, Texas A&M University, College Station, TX, USA
    131 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...