Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules View Full Text


Ontology type: schema:ScholarlyArticle     


Article Info

DATE

2018-06-07

AUTHORS

Chichang Jou

ABSTRACT

Along with the popularity of the world wide web, data volumes inside web databases have been increasing tremendously. These deep web contents, hidden behind the query interfaces, are of much better quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Many deep web contents related applications, like named entity attribute collection, topic-focused crawling, and heterogeneous data integration, are based on understanding schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. Additionally, to extract these hidden data, the schema needs to include many form submission related information, like cookies and action types. We design and implement a Heuristics-based deep web query interface Schema Extraction system (HSE). In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and use a dynamic similarity threshold to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user’s view and the designer’s view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Supplemented with form submission related information, the extracted schemas are then stored in the XML format, so that they could be utilized in further applications, like schema matching and merging for federated query interface integration. The experimental results on the TEL-8 dataset illustrate that HSE produces effective performance. More... »

PAGES

1-12

References to SciGraph publications

Identifiers

URI

http://scigraph.springernature.com/pub.10.1007/s10796-018-9863-6

DOI

http://dx.doi.org/10.1007/s10796-018-9863-6

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1104452799


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0806", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information Systems", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "Tamkang University", 
          "id": "https://www.grid.ac/institutes/grid.264580.d", 
          "name": [
            "Department of Information Management, Tamkang University, 151 Ying-zhuan Road, 25137, Tamsui, Taiwan, People\u2019s Republic of China"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Jou", 
        "givenName": "Chichang", 
        "id": "sg:person.015125212575.92", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015125212575.92"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "sg:pub.10.1007/s00778-013-0323-0", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1008357845", 
          "https://doi.org/10.1007/s00778-013-0323-0"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/11581062_3", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1014402576", 
          "https://doi.org/10.1007/11581062_3"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/11581062_3", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1014402576", 
          "https://doi.org/10.1007/11581062_3"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/s11280-006-0010-9", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1029875833", 
          "https://doi.org/10.1007/s11280-006-0010-9"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/1007568.1007583", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1029920189"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/s10796-012-9405-6", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1044619145", 
          "https://doi.org/10.1007/s10796-012-9405-6"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/2460383.2460387", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1045606540"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/2955129.2955170", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1047781612"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/978-3-642-04141-9_4", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1050226944", 
          "https://doi.org/10.1007/978-3-642-04141-9_4"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/978-3-642-04141-9_4", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1050226944", 
          "https://doi.org/10.1007/978-3-642-04141-9_4"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/1645953.1645959", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1050344907"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.14778/1453856.1453931", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1067367367"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.14778/1687627.1687665", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1067367569"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1109/iske.2015.94", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1094202485"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1109/cist.2016.7805022", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1094648652"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1166/asl.2018.10714", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1101491273"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2018-06-07", 
    "datePublishedReg": "2018-06-07", 
    "description": "Along with the popularity of the world wide web, data volumes inside web databases have been increasing tremendously. These deep web contents, hidden behind the query interfaces, are of much better quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Many deep web contents related applications, like named entity attribute collection, topic-focused crawling, and heterogeneous data integration, are based on understanding schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. Additionally, to extract these hidden data, the schema needs to include many form submission related information, like cookies and action types. We design and implement a Heuristics-based deep web query interface Schema Extraction system (HSE). In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and use a dynamic similarity threshold to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user\u2019s view and the designer\u2019s view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Supplemented with form submission related information, the extracted schemas are then stored in the XML format, so that they could be utilized in further applications, like schema matching and merging for federated query interface integration. The experimental results on the TEL-8 dataset illustrate that HSE produces effective performance.", 
    "genre": "research_article", 
    "id": "sg:pub.10.1007/s10796-018-9863-6", 
    "inLanguage": [
      "en"
    ], 
    "isAccessibleForFree": false, 
    "isPartOf": [
      {
        "id": "sg:journal.1136609", 
        "issn": [
          "1387-3326", 
          "1572-9419"
        ], 
        "name": "Information Systems Frontiers", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "1", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "21"
      }
    ], 
    "name": "Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules", 
    "pagination": "1-12", 
    "productId": [
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1007/s10796-018-9863-6"
        ]
      }, 
      {
        "name": "readcube_id", 
        "type": "PropertyValue", 
        "value": [
          "29abb2be5361c0c89fb3d77f9f518944f454d157baf4a95d0406349603471b2c"
        ]
      }, 
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1104452799"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1007/s10796-018-9863-6", 
      "https://app.dimensions.ai/details/publication/pub.1104452799"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2019-04-15T08:51", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000374_0000000374/records_119737_00000001.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "https://link.springer.com/10.1007%2Fs10796-018-9863-6"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/s10796-018-9863-6'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/s10796-018-9863-6'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/s10796-018-9863-6'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/s10796-018-9863-6'


 

This table displays all metadata directly associated to this object as RDF triples.

108 TRIPLES      21 PREDICATES      40 URIs      18 LITERALS      7 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1007/s10796-018-9863-6 schema:about anzsrc-for:08
2 anzsrc-for:0806
3 schema:author N28ce90ceb80944ac920d9b41d0ade18b
4 schema:citation sg:pub.10.1007/11581062_3
5 sg:pub.10.1007/978-3-642-04141-9_4
6 sg:pub.10.1007/s00778-013-0323-0
7 sg:pub.10.1007/s10796-012-9405-6
8 sg:pub.10.1007/s11280-006-0010-9
9 https://doi.org/10.1109/cist.2016.7805022
10 https://doi.org/10.1109/iske.2015.94
11 https://doi.org/10.1145/1007568.1007583
12 https://doi.org/10.1145/1645953.1645959
13 https://doi.org/10.1145/2460383.2460387
14 https://doi.org/10.1145/2955129.2955170
15 https://doi.org/10.1166/asl.2018.10714
16 https://doi.org/10.14778/1453856.1453931
17 https://doi.org/10.14778/1687627.1687665
18 schema:datePublished 2018-06-07
19 schema:datePublishedReg 2018-06-07
20 schema:description Along with the popularity of the world wide web, data volumes inside web databases have been increasing tremendously. These deep web contents, hidden behind the query interfaces, are of much better quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Many deep web contents related applications, like named entity attribute collection, topic-focused crawling, and heterogeneous data integration, are based on understanding schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. Additionally, to extract these hidden data, the schema needs to include many form submission related information, like cookies and action types. We design and implement a Heuristics-based deep web query interface Schema Extraction system (HSE). In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and use a dynamic similarity threshold to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user’s view and the designer’s view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Supplemented with form submission related information, the extracted schemas are then stored in the XML format, so that they could be utilized in further applications, like schema matching and merging for federated query interface integration. The experimental results on the TEL-8 dataset illustrate that HSE produces effective performance.
21 schema:genre research_article
22 schema:inLanguage en
23 schema:isAccessibleForFree false
24 schema:isPartOf N183bf382298943f0b8fc5ffb62f127a4
25 N2371af613cce46dc8d68f94fe9140355
26 sg:journal.1136609
27 schema:name Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules
28 schema:pagination 1-12
29 schema:productId N2880db80e095443d9ae4d490f9a53213
30 N5f9028679e0b43c1b171278e0223e4e0
31 N63e81e1733f14691b1888972d71270e7
32 schema:sameAs https://app.dimensions.ai/details/publication/pub.1104452799
33 https://doi.org/10.1007/s10796-018-9863-6
34 schema:sdDatePublished 2019-04-15T08:51
35 schema:sdLicense https://scigraph.springernature.com/explorer/license/
36 schema:sdPublisher N11aa45ae295c4e608044e499a3f7dccb
37 schema:url https://link.springer.com/10.1007%2Fs10796-018-9863-6
38 sgo:license sg:explorer/license/
39 sgo:sdDataset articles
40 rdf:type schema:ScholarlyArticle
41 N11aa45ae295c4e608044e499a3f7dccb schema:name Springer Nature - SN SciGraph project
42 rdf:type schema:Organization
43 N183bf382298943f0b8fc5ffb62f127a4 schema:volumeNumber 21
44 rdf:type schema:PublicationVolume
45 N2371af613cce46dc8d68f94fe9140355 schema:issueNumber 1
46 rdf:type schema:PublicationIssue
47 N2880db80e095443d9ae4d490f9a53213 schema:name dimensions_id
48 schema:value pub.1104452799
49 rdf:type schema:PropertyValue
50 N28ce90ceb80944ac920d9b41d0ade18b rdf:first sg:person.015125212575.92
51 rdf:rest rdf:nil
52 N5f9028679e0b43c1b171278e0223e4e0 schema:name doi
53 schema:value 10.1007/s10796-018-9863-6
54 rdf:type schema:PropertyValue
55 N63e81e1733f14691b1888972d71270e7 schema:name readcube_id
56 schema:value 29abb2be5361c0c89fb3d77f9f518944f454d157baf4a95d0406349603471b2c
57 rdf:type schema:PropertyValue
58 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
59 schema:name Information and Computing Sciences
60 rdf:type schema:DefinedTerm
61 anzsrc-for:0806 schema:inDefinedTermSet anzsrc-for:
62 schema:name Information Systems
63 rdf:type schema:DefinedTerm
64 sg:journal.1136609 schema:issn 1387-3326
65 1572-9419
66 schema:name Information Systems Frontiers
67 rdf:type schema:Periodical
68 sg:person.015125212575.92 schema:affiliation https://www.grid.ac/institutes/grid.264580.d
69 schema:familyName Jou
70 schema:givenName Chichang
71 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015125212575.92
72 rdf:type schema:Person
73 sg:pub.10.1007/11581062_3 schema:sameAs https://app.dimensions.ai/details/publication/pub.1014402576
74 https://doi.org/10.1007/11581062_3
75 rdf:type schema:CreativeWork
76 sg:pub.10.1007/978-3-642-04141-9_4 schema:sameAs https://app.dimensions.ai/details/publication/pub.1050226944
77 https://doi.org/10.1007/978-3-642-04141-9_4
78 rdf:type schema:CreativeWork
79 sg:pub.10.1007/s00778-013-0323-0 schema:sameAs https://app.dimensions.ai/details/publication/pub.1008357845
80 https://doi.org/10.1007/s00778-013-0323-0
81 rdf:type schema:CreativeWork
82 sg:pub.10.1007/s10796-012-9405-6 schema:sameAs https://app.dimensions.ai/details/publication/pub.1044619145
83 https://doi.org/10.1007/s10796-012-9405-6
84 rdf:type schema:CreativeWork
85 sg:pub.10.1007/s11280-006-0010-9 schema:sameAs https://app.dimensions.ai/details/publication/pub.1029875833
86 https://doi.org/10.1007/s11280-006-0010-9
87 rdf:type schema:CreativeWork
88 https://doi.org/10.1109/cist.2016.7805022 schema:sameAs https://app.dimensions.ai/details/publication/pub.1094648652
89 rdf:type schema:CreativeWork
90 https://doi.org/10.1109/iske.2015.94 schema:sameAs https://app.dimensions.ai/details/publication/pub.1094202485
91 rdf:type schema:CreativeWork
92 https://doi.org/10.1145/1007568.1007583 schema:sameAs https://app.dimensions.ai/details/publication/pub.1029920189
93 rdf:type schema:CreativeWork
94 https://doi.org/10.1145/1645953.1645959 schema:sameAs https://app.dimensions.ai/details/publication/pub.1050344907
95 rdf:type schema:CreativeWork
96 https://doi.org/10.1145/2460383.2460387 schema:sameAs https://app.dimensions.ai/details/publication/pub.1045606540
97 rdf:type schema:CreativeWork
98 https://doi.org/10.1145/2955129.2955170 schema:sameAs https://app.dimensions.ai/details/publication/pub.1047781612
99 rdf:type schema:CreativeWork
100 https://doi.org/10.1166/asl.2018.10714 schema:sameAs https://app.dimensions.ai/details/publication/pub.1101491273
101 rdf:type schema:CreativeWork
102 https://doi.org/10.14778/1453856.1453931 schema:sameAs https://app.dimensions.ai/details/publication/pub.1067367367
103 rdf:type schema:CreativeWork
104 https://doi.org/10.14778/1687627.1687665 schema:sameAs https://app.dimensions.ai/details/publication/pub.1067367569
105 rdf:type schema:CreativeWork
106 https://www.grid.ac/institutes/grid.264580.d schema:alternateName Tamkang University
107 schema:name Department of Information Management, Tamkang University, 151 Ying-zhuan Road, 25137, Tamsui, Taiwan, People’s Republic of China
108 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...