Cleaning Web Pages for Effective Web Content Mining View Full Text


Ontology type: schema:Chapter     


Chapter Info

DATE

2006

AUTHORS

Jing Li , C. I. Ezeife

ABSTRACT

Classifying and mining noise-free web pages will improve on accuracy of search results as well as search speed, and may benefit web-page organization applications (e.g., keyword-based search engines and taxonomic web page categorization applications). Noise on web pages are irrelevant to the main content on the web pages being mined, and include advertisements, navigation bar, and copyright notices. The few existing work on web page cleaning detect noise blocks with exact matching contents but are weak at detecting near duplicate blocks, characterized by items like navigation bars. This paper proposes a system, WebPageCleaner, for eliminating noise blocks from web pages for purposes of improving the accuracy and efficiency of web content mining. A vision-based technique is employed for extracting blocks from web pages. Then, relevant web page blocks are identified as those with high importance level by analyzing such physical features of the blocks as the block location, percentage of web links on the block, and level of similarity of block contents to other blocks. Important blocks are exported to be used for web content mining using Naive Bayes text classification. Experiments show that WebPageCleaner leads to a more accurate and efficient web page classification results than comparable existing approaches. More... »

PAGES

560-571

References to SciGraph publications

  • 2003-04-15. Extracting Content Structure for Web Pages Based on Visual Representation in WEB TECHNOLOGIES AND APPLICATIONS
  • Book

    TITLE

    Database and Expert Systems Applications

    ISBN

    978-3-540-37871-6
    978-3-540-37872-3

    Author Affiliations

    Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/11827405_55

    DOI

    http://dx.doi.org/10.1007/11827405_55

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1000529130


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Artificial Intelligence and Image Processing", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "University of Windsor", 
              "id": "https://www.grid.ac/institutes/grid.267455.7", 
              "name": [
                "School of Computer Science, University of Windsor, N9B 3P4, Windsor, Ontario, Canada"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Li", 
            "givenName": "Jing", 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "University of Windsor", 
              "id": "https://www.grid.ac/institutes/grid.267455.7", 
              "name": [
                "School of Computer Science, University of Windsor, N9B 3P4, Windsor, Ontario, Canada"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Ezeife", 
            "givenName": "C. I.", 
            "id": "sg:person.01200460536.41", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01200460536.41"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "https://doi.org/10.1145/1008992.1009035", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1009660816"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/956750.956785", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1016384532"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-36901-5_42", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1022869484", 
              "https://doi.org/10.1007/3-540-36901-5_42"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-36901-5_42", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1022869484", 
              "https://doi.org/10.1007/3-540-36901-5_42"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.4018/jdwm.2005040101", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1027173160"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/223784.223807", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1030246694"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/988672.988700", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1032277700"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/511446.511522", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1037067381"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/775047.775134", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1039512325"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2006", 
        "datePublishedReg": "2006-01-01", 
        "description": "Classifying and mining noise-free web pages will improve on accuracy of search results as well as search speed, and may benefit web-page organization applications (e.g., keyword-based search engines and taxonomic web page categorization applications). Noise on web pages are irrelevant to the main content on the web pages being mined, and include advertisements, navigation bar, and copyright notices. The few existing work on web page cleaning detect noise blocks with exact matching contents but are weak at detecting near duplicate blocks, characterized by items like navigation bars. This paper proposes a system, WebPageCleaner, for eliminating noise blocks from web pages for purposes of improving the accuracy and efficiency of web content mining. A vision-based technique is employed for extracting blocks from web pages. Then, relevant web page blocks are identified as those with high importance level by analyzing such physical features of the blocks as the block location, percentage of web links on the block, and level of similarity of block contents to other blocks. Important blocks are exported to be used for web content mining using Naive Bayes text classification. Experiments show that WebPageCleaner leads to a more accurate and efficient web page classification results than comparable existing approaches.", 
        "editor": [
          {
            "familyName": "Bressan", 
            "givenName": "St\u00e9phane", 
            "type": "Person"
          }, 
          {
            "familyName": "K\u00fcng", 
            "givenName": "Josef", 
            "type": "Person"
          }, 
          {
            "familyName": "Wagner", 
            "givenName": "Roland", 
            "type": "Person"
          }
        ], 
        "genre": "chapter", 
        "id": "sg:pub.10.1007/11827405_55", 
        "inLanguage": [
          "en"
        ], 
        "isAccessibleForFree": false, 
        "isPartOf": {
          "isbn": [
            "978-3-540-37871-6", 
            "978-3-540-37872-3"
          ], 
          "name": "Database and Expert Systems Applications", 
          "type": "Book"
        }, 
        "name": "Cleaning Web Pages for Effective Web Content Mining", 
        "pagination": "560-571", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1000529130"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/11827405_55"
            ]
          }, 
          {
            "name": "readcube_id", 
            "type": "PropertyValue", 
            "value": [
              "02849759bdeb1e59769a82bf5048223d9a9edddc3aa80ed457c95f5cf7cf9c99"
            ]
          }
        ], 
        "publisher": {
          "location": "Berlin, Heidelberg", 
          "name": "Springer Berlin Heidelberg", 
          "type": "Organisation"
        }, 
        "sameAs": [
          "https://doi.org/10.1007/11827405_55", 
          "https://app.dimensions.ai/details/publication/pub.1000529130"
        ], 
        "sdDataset": "chapters", 
        "sdDatePublished": "2019-04-16T07:30", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000356_0000000356/records_57883_00000000.jsonl", 
        "type": "Chapter", 
        "url": "https://link.springer.com/10.1007%2F11827405_55"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/11827405_55'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/11827405_55'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/11827405_55'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/11827405_55'


     

    This table displays all metadata directly associated to this object as RDF triples.

    106 TRIPLES      23 PREDICATES      35 URIs      20 LITERALS      8 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/11827405_55 schema:about anzsrc-for:08
    2 anzsrc-for:0801
    3 schema:author Na20b4cdac50e494ba3f729e3c8020204
    4 schema:citation sg:pub.10.1007/3-540-36901-5_42
    5 https://doi.org/10.1145/1008992.1009035
    6 https://doi.org/10.1145/223784.223807
    7 https://doi.org/10.1145/511446.511522
    8 https://doi.org/10.1145/775047.775134
    9 https://doi.org/10.1145/956750.956785
    10 https://doi.org/10.1145/988672.988700
    11 https://doi.org/10.4018/jdwm.2005040101
    12 schema:datePublished 2006
    13 schema:datePublishedReg 2006-01-01
    14 schema:description Classifying and mining noise-free web pages will improve on accuracy of search results as well as search speed, and may benefit web-page organization applications (e.g., keyword-based search engines and taxonomic web page categorization applications). Noise on web pages are irrelevant to the main content on the web pages being mined, and include advertisements, navigation bar, and copyright notices. The few existing work on web page cleaning detect noise blocks with exact matching contents but are weak at detecting near duplicate blocks, characterized by items like navigation bars. This paper proposes a system, WebPageCleaner, for eliminating noise blocks from web pages for purposes of improving the accuracy and efficiency of web content mining. A vision-based technique is employed for extracting blocks from web pages. Then, relevant web page blocks are identified as those with high importance level by analyzing such physical features of the blocks as the block location, percentage of web links on the block, and level of similarity of block contents to other blocks. Important blocks are exported to be used for web content mining using Naive Bayes text classification. Experiments show that WebPageCleaner leads to a more accurate and efficient web page classification results than comparable existing approaches.
    15 schema:editor N12d938fe73cc4b1c9860cb67e4b08e19
    16 schema:genre chapter
    17 schema:inLanguage en
    18 schema:isAccessibleForFree false
    19 schema:isPartOf N62e04c50bc9349b0bb1e7e022232af36
    20 schema:name Cleaning Web Pages for Effective Web Content Mining
    21 schema:pagination 560-571
    22 schema:productId N66a36a87db3a47a4b76e1ea2b69b30f0
    23 N7a5bd18ef2404c729aced4e17c064045
    24 Nb6d91fcbecb54c7aa8d494724f401b5f
    25 schema:publisher Nfc5a3d954340445e809d3a9a65573d1b
    26 schema:sameAs https://app.dimensions.ai/details/publication/pub.1000529130
    27 https://doi.org/10.1007/11827405_55
    28 schema:sdDatePublished 2019-04-16T07:30
    29 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    30 schema:sdPublisher Nb2cbca9ccaaa424db6aef0fcaaed9396
    31 schema:url https://link.springer.com/10.1007%2F11827405_55
    32 sgo:license sg:explorer/license/
    33 sgo:sdDataset chapters
    34 rdf:type schema:Chapter
    35 N12d938fe73cc4b1c9860cb67e4b08e19 rdf:first Neabad8297c8b4e9a9cbbdec8f415ad53
    36 rdf:rest Ne5d6e0d35e37425f8a5268fe20b5026b
    37 N1d34b66e2f6a466c96a17f53cb4a0124 schema:affiliation https://www.grid.ac/institutes/grid.267455.7
    38 schema:familyName Li
    39 schema:givenName Jing
    40 rdf:type schema:Person
    41 N4079ca5803dc494bb1f11081291cb2ea schema:familyName Küng
    42 schema:givenName Josef
    43 rdf:type schema:Person
    44 N55ba98e18d2042839109f3c7dc2c4d3c schema:familyName Wagner
    45 schema:givenName Roland
    46 rdf:type schema:Person
    47 N62e04c50bc9349b0bb1e7e022232af36 schema:isbn 978-3-540-37871-6
    48 978-3-540-37872-3
    49 schema:name Database and Expert Systems Applications
    50 rdf:type schema:Book
    51 N66a36a87db3a47a4b76e1ea2b69b30f0 schema:name readcube_id
    52 schema:value 02849759bdeb1e59769a82bf5048223d9a9edddc3aa80ed457c95f5cf7cf9c99
    53 rdf:type schema:PropertyValue
    54 N7a5bd18ef2404c729aced4e17c064045 schema:name doi
    55 schema:value 10.1007/11827405_55
    56 rdf:type schema:PropertyValue
    57 Na06edac395fb4b54b845b90a14351998 rdf:first sg:person.01200460536.41
    58 rdf:rest rdf:nil
    59 Na20b4cdac50e494ba3f729e3c8020204 rdf:first N1d34b66e2f6a466c96a17f53cb4a0124
    60 rdf:rest Na06edac395fb4b54b845b90a14351998
    61 Nb2cbca9ccaaa424db6aef0fcaaed9396 schema:name Springer Nature - SN SciGraph project
    62 rdf:type schema:Organization
    63 Nb6d91fcbecb54c7aa8d494724f401b5f schema:name dimensions_id
    64 schema:value pub.1000529130
    65 rdf:type schema:PropertyValue
    66 Nd2373dd462784842b2fa65b7b811bd83 rdf:first N55ba98e18d2042839109f3c7dc2c4d3c
    67 rdf:rest rdf:nil
    68 Ne5d6e0d35e37425f8a5268fe20b5026b rdf:first N4079ca5803dc494bb1f11081291cb2ea
    69 rdf:rest Nd2373dd462784842b2fa65b7b811bd83
    70 Neabad8297c8b4e9a9cbbdec8f415ad53 schema:familyName Bressan
    71 schema:givenName Stéphane
    72 rdf:type schema:Person
    73 Nfc5a3d954340445e809d3a9a65573d1b schema:location Berlin, Heidelberg
    74 schema:name Springer Berlin Heidelberg
    75 rdf:type schema:Organisation
    76 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    77 schema:name Information and Computing Sciences
    78 rdf:type schema:DefinedTerm
    79 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
    80 schema:name Artificial Intelligence and Image Processing
    81 rdf:type schema:DefinedTerm
    82 sg:person.01200460536.41 schema:affiliation https://www.grid.ac/institutes/grid.267455.7
    83 schema:familyName Ezeife
    84 schema:givenName C. I.
    85 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01200460536.41
    86 rdf:type schema:Person
    87 sg:pub.10.1007/3-540-36901-5_42 schema:sameAs https://app.dimensions.ai/details/publication/pub.1022869484
    88 https://doi.org/10.1007/3-540-36901-5_42
    89 rdf:type schema:CreativeWork
    90 https://doi.org/10.1145/1008992.1009035 schema:sameAs https://app.dimensions.ai/details/publication/pub.1009660816
    91 rdf:type schema:CreativeWork
    92 https://doi.org/10.1145/223784.223807 schema:sameAs https://app.dimensions.ai/details/publication/pub.1030246694
    93 rdf:type schema:CreativeWork
    94 https://doi.org/10.1145/511446.511522 schema:sameAs https://app.dimensions.ai/details/publication/pub.1037067381
    95 rdf:type schema:CreativeWork
    96 https://doi.org/10.1145/775047.775134 schema:sameAs https://app.dimensions.ai/details/publication/pub.1039512325
    97 rdf:type schema:CreativeWork
    98 https://doi.org/10.1145/956750.956785 schema:sameAs https://app.dimensions.ai/details/publication/pub.1016384532
    99 rdf:type schema:CreativeWork
    100 https://doi.org/10.1145/988672.988700 schema:sameAs https://app.dimensions.ai/details/publication/pub.1032277700
    101 rdf:type schema:CreativeWork
    102 https://doi.org/10.4018/jdwm.2005040101 schema:sameAs https://app.dimensions.ai/details/publication/pub.1027173160
    103 rdf:type schema:CreativeWork
    104 https://www.grid.ac/institutes/grid.267455.7 schema:alternateName University of Windsor
    105 schema:name School of Computer Science, University of Windsor, N9B 3P4, Windsor, Ontario, Canada
    106 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...