Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules View Full Text


Ontology type: schema:ScholarlyArticle     


Article Info

DATE

2018-06-07

AUTHORS

Chichang Jou

ABSTRACT

Along with the popularity of the world wide web, data volumes inside web databases have been increasing tremendously. These deep web contents, hidden behind the query interfaces, are of much better quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Many deep web contents related applications, like named entity attribute collection, topic-focused crawling, and heterogeneous data integration, are based on understanding schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. Additionally, to extract these hidden data, the schema needs to include many form submission related information, like cookies and action types. We design and implement a Heuristics-based deep web query interface Schema Extraction system (HSE). In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and use a dynamic similarity threshold to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user’s view and the designer’s view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Supplemented with form submission related information, the extracted schemas are then stored in the XML format, so that they could be utilized in further applications, like schema matching and merging for federated query interface integration. The experimental results on the TEL-8 dataset illustrate that HSE produces effective performance. More... »

PAGES

1-12

References to SciGraph publications

  • 2005. Constructing Interface Schemas for Search Interfaces of Web Databases in WEB INFORMATION SYSTEMS ENGINEERING – WISE 2005
  • 2007-06. Towards Deeper Understanding of the Search Interfaces of the Deep Web in WORLD WIDE WEB
  • 2013-07. Active XML-based Web data integration in INFORMATION SYSTEMS FRONTIERS
  • 2009. Modeling and Extracting Deep-Web Query Interfaces in ADVANCES IN INFORMATION AND INTELLIGENT SYSTEMS
  • 2013-10. The ontological key: automatically understanding and integrating forms to access the deep Web in THE VLDB JOURNAL
  • Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/s10796-018-9863-6

    DOI

    http://dx.doi.org/10.1007/s10796-018-9863-6

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1104452799


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0806", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information Systems", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "Tamkang University", 
              "id": "https://www.grid.ac/institutes/grid.264580.d", 
              "name": [
                "Department of Information Management, Tamkang University, 151 Ying-zhuan Road, 25137, Tamsui, Taiwan, People\u2019s Republic of China"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Jou", 
            "givenName": "Chichang", 
            "id": "sg:person.015125212575.92", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015125212575.92"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "sg:pub.10.1007/s00778-013-0323-0", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1008357845", 
              "https://doi.org/10.1007/s00778-013-0323-0"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/11581062_3", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1014402576", 
              "https://doi.org/10.1007/11581062_3"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/11581062_3", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1014402576", 
              "https://doi.org/10.1007/11581062_3"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/s11280-006-0010-9", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1029875833", 
              "https://doi.org/10.1007/s11280-006-0010-9"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/1007568.1007583", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1029920189"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/s10796-012-9405-6", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1044619145", 
              "https://doi.org/10.1007/s10796-012-9405-6"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/2460383.2460387", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1045606540"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/2955129.2955170", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1047781612"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-04141-9_4", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1050226944", 
              "https://doi.org/10.1007/978-3-642-04141-9_4"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-04141-9_4", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1050226944", 
              "https://doi.org/10.1007/978-3-642-04141-9_4"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/1645953.1645959", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1050344907"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.14778/1453856.1453931", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1067367367"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.14778/1687627.1687665", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1067367569"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/iske.2015.94", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1094202485"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/cist.2016.7805022", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1094648652"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1166/asl.2018.10714", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1101491273"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2018-06-07", 
        "datePublishedReg": "2018-06-07", 
        "description": "Along with the popularity of the world wide web, data volumes inside web databases have been increasing tremendously. These deep web contents, hidden behind the query interfaces, are of much better quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Many deep web contents related applications, like named entity attribute collection, topic-focused crawling, and heterogeneous data integration, are based on understanding schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. Additionally, to extract these hidden data, the schema needs to include many form submission related information, like cookies and action types. We design and implement a Heuristics-based deep web query interface Schema Extraction system (HSE). In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and use a dynamic similarity threshold to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user\u2019s view and the designer\u2019s view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Supplemented with form submission related information, the extracted schemas are then stored in the XML format, so that they could be utilized in further applications, like schema matching and merging for federated query interface integration. The experimental results on the TEL-8 dataset illustrate that HSE produces effective performance.", 
        "genre": "research_article", 
        "id": "sg:pub.10.1007/s10796-018-9863-6", 
        "inLanguage": [
          "en"
        ], 
        "isAccessibleForFree": false, 
        "isPartOf": [
          {
            "id": "sg:journal.1136609", 
            "issn": [
              "1387-3326", 
              "1572-9419"
            ], 
            "name": "Information Systems Frontiers", 
            "type": "Periodical"
          }, 
          {
            "issueNumber": "1", 
            "type": "PublicationIssue"
          }, 
          {
            "type": "PublicationVolume", 
            "volumeNumber": "21"
          }
        ], 
        "name": "Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules", 
        "pagination": "1-12", 
        "productId": [
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/s10796-018-9863-6"
            ]
          }, 
          {
            "name": "readcube_id", 
            "type": "PropertyValue", 
            "value": [
              "29abb2be5361c0c89fb3d77f9f518944f454d157baf4a95d0406349603471b2c"
            ]
          }, 
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1104452799"
            ]
          }
        ], 
        "sameAs": [
          "https://doi.org/10.1007/s10796-018-9863-6", 
          "https://app.dimensions.ai/details/publication/pub.1104452799"
        ], 
        "sdDataset": "articles", 
        "sdDatePublished": "2019-04-15T08:51", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000374_0000000374/records_119737_00000001.jsonl", 
        "type": "ScholarlyArticle", 
        "url": "https://link.springer.com/10.1007%2Fs10796-018-9863-6"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/s10796-018-9863-6'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/s10796-018-9863-6'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/s10796-018-9863-6'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/s10796-018-9863-6'


     

    This table displays all metadata directly associated to this object as RDF triples.

    108 TRIPLES      21 PREDICATES      40 URIs      18 LITERALS      7 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/s10796-018-9863-6 schema:about anzsrc-for:08
    2 anzsrc-for:0806
    3 schema:author N83bdfcb58eca451d86c468ae46d47563
    4 schema:citation sg:pub.10.1007/11581062_3
    5 sg:pub.10.1007/978-3-642-04141-9_4
    6 sg:pub.10.1007/s00778-013-0323-0
    7 sg:pub.10.1007/s10796-012-9405-6
    8 sg:pub.10.1007/s11280-006-0010-9
    9 https://doi.org/10.1109/cist.2016.7805022
    10 https://doi.org/10.1109/iske.2015.94
    11 https://doi.org/10.1145/1007568.1007583
    12 https://doi.org/10.1145/1645953.1645959
    13 https://doi.org/10.1145/2460383.2460387
    14 https://doi.org/10.1145/2955129.2955170
    15 https://doi.org/10.1166/asl.2018.10714
    16 https://doi.org/10.14778/1453856.1453931
    17 https://doi.org/10.14778/1687627.1687665
    18 schema:datePublished 2018-06-07
    19 schema:datePublishedReg 2018-06-07
    20 schema:description Along with the popularity of the world wide web, data volumes inside web databases have been increasing tremendously. These deep web contents, hidden behind the query interfaces, are of much better quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Many deep web contents related applications, like named entity attribute collection, topic-focused crawling, and heterogeneous data integration, are based on understanding schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. Additionally, to extract these hidden data, the schema needs to include many form submission related information, like cookies and action types. We design and implement a Heuristics-based deep web query interface Schema Extraction system (HSE). In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and use a dynamic similarity threshold to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user’s view and the designer’s view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Supplemented with form submission related information, the extracted schemas are then stored in the XML format, so that they could be utilized in further applications, like schema matching and merging for federated query interface integration. The experimental results on the TEL-8 dataset illustrate that HSE produces effective performance.
    21 schema:genre research_article
    22 schema:inLanguage en
    23 schema:isAccessibleForFree false
    24 schema:isPartOf N16b88e1bef084a729fd5c9b508f0836e
    25 Nfbc8b742ed87462e927a84a6b1f40c9f
    26 sg:journal.1136609
    27 schema:name Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules
    28 schema:pagination 1-12
    29 schema:productId N1ecc4b15d2f24c6d8dfd356b39193049
    30 N2cb03a865cba48cebe9022830b7618fc
    31 Nf20e02efd59b4784a93b7a8ce2621fdb
    32 schema:sameAs https://app.dimensions.ai/details/publication/pub.1104452799
    33 https://doi.org/10.1007/s10796-018-9863-6
    34 schema:sdDatePublished 2019-04-15T08:51
    35 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    36 schema:sdPublisher Nea4cb56f6b3943699d73783844581435
    37 schema:url https://link.springer.com/10.1007%2Fs10796-018-9863-6
    38 sgo:license sg:explorer/license/
    39 sgo:sdDataset articles
    40 rdf:type schema:ScholarlyArticle
    41 N16b88e1bef084a729fd5c9b508f0836e schema:volumeNumber 21
    42 rdf:type schema:PublicationVolume
    43 N1ecc4b15d2f24c6d8dfd356b39193049 schema:name readcube_id
    44 schema:value 29abb2be5361c0c89fb3d77f9f518944f454d157baf4a95d0406349603471b2c
    45 rdf:type schema:PropertyValue
    46 N2cb03a865cba48cebe9022830b7618fc schema:name dimensions_id
    47 schema:value pub.1104452799
    48 rdf:type schema:PropertyValue
    49 N83bdfcb58eca451d86c468ae46d47563 rdf:first sg:person.015125212575.92
    50 rdf:rest rdf:nil
    51 Nea4cb56f6b3943699d73783844581435 schema:name Springer Nature - SN SciGraph project
    52 rdf:type schema:Organization
    53 Nf20e02efd59b4784a93b7a8ce2621fdb schema:name doi
    54 schema:value 10.1007/s10796-018-9863-6
    55 rdf:type schema:PropertyValue
    56 Nfbc8b742ed87462e927a84a6b1f40c9f schema:issueNumber 1
    57 rdf:type schema:PublicationIssue
    58 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    59 schema:name Information and Computing Sciences
    60 rdf:type schema:DefinedTerm
    61 anzsrc-for:0806 schema:inDefinedTermSet anzsrc-for:
    62 schema:name Information Systems
    63 rdf:type schema:DefinedTerm
    64 sg:journal.1136609 schema:issn 1387-3326
    65 1572-9419
    66 schema:name Information Systems Frontiers
    67 rdf:type schema:Periodical
    68 sg:person.015125212575.92 schema:affiliation https://www.grid.ac/institutes/grid.264580.d
    69 schema:familyName Jou
    70 schema:givenName Chichang
    71 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015125212575.92
    72 rdf:type schema:Person
    73 sg:pub.10.1007/11581062_3 schema:sameAs https://app.dimensions.ai/details/publication/pub.1014402576
    74 https://doi.org/10.1007/11581062_3
    75 rdf:type schema:CreativeWork
    76 sg:pub.10.1007/978-3-642-04141-9_4 schema:sameAs https://app.dimensions.ai/details/publication/pub.1050226944
    77 https://doi.org/10.1007/978-3-642-04141-9_4
    78 rdf:type schema:CreativeWork
    79 sg:pub.10.1007/s00778-013-0323-0 schema:sameAs https://app.dimensions.ai/details/publication/pub.1008357845
    80 https://doi.org/10.1007/s00778-013-0323-0
    81 rdf:type schema:CreativeWork
    82 sg:pub.10.1007/s10796-012-9405-6 schema:sameAs https://app.dimensions.ai/details/publication/pub.1044619145
    83 https://doi.org/10.1007/s10796-012-9405-6
    84 rdf:type schema:CreativeWork
    85 sg:pub.10.1007/s11280-006-0010-9 schema:sameAs https://app.dimensions.ai/details/publication/pub.1029875833
    86 https://doi.org/10.1007/s11280-006-0010-9
    87 rdf:type schema:CreativeWork
    88 https://doi.org/10.1109/cist.2016.7805022 schema:sameAs https://app.dimensions.ai/details/publication/pub.1094648652
    89 rdf:type schema:CreativeWork
    90 https://doi.org/10.1109/iske.2015.94 schema:sameAs https://app.dimensions.ai/details/publication/pub.1094202485
    91 rdf:type schema:CreativeWork
    92 https://doi.org/10.1145/1007568.1007583 schema:sameAs https://app.dimensions.ai/details/publication/pub.1029920189
    93 rdf:type schema:CreativeWork
    94 https://doi.org/10.1145/1645953.1645959 schema:sameAs https://app.dimensions.ai/details/publication/pub.1050344907
    95 rdf:type schema:CreativeWork
    96 https://doi.org/10.1145/2460383.2460387 schema:sameAs https://app.dimensions.ai/details/publication/pub.1045606540
    97 rdf:type schema:CreativeWork
    98 https://doi.org/10.1145/2955129.2955170 schema:sameAs https://app.dimensions.ai/details/publication/pub.1047781612
    99 rdf:type schema:CreativeWork
    100 https://doi.org/10.1166/asl.2018.10714 schema:sameAs https://app.dimensions.ai/details/publication/pub.1101491273
    101 rdf:type schema:CreativeWork
    102 https://doi.org/10.14778/1453856.1453931 schema:sameAs https://app.dimensions.ai/details/publication/pub.1067367367
    103 rdf:type schema:CreativeWork
    104 https://doi.org/10.14778/1687627.1687665 schema:sameAs https://app.dimensions.ai/details/publication/pub.1067367569
    105 rdf:type schema:CreativeWork
    106 https://www.grid.ac/institutes/grid.264580.d schema:alternateName Tamkang University
    107 schema:name Department of Information Management, Tamkang University, 151 Ying-zhuan Road, 25137, Tamsui, Taiwan, People’s Republic of China
    108 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...