Using Restrictive Classification and Meta Classification for Junk Elimination View Full Text


Ontology type: schema:Chapter      Open Access: True


Chapter Info

DATE

2005

AUTHORS

Stefan Siersdorfer , Gerhard Weikum

ABSTRACT

This paper addresses the problem of performing supervised classification on document collections containing also junk documents. With ”junk documents” we mean documents that do not belong to the topic categories (classes) we are interested in. This type of documents can typically not be covered by the training set; nevertheless in many real world applications (e.g. classification of web or intranet content, focused crawling etc.) such documents occur quite often and a classifier has to make a decision about them. We tackle this problem by using restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate classes with low confidence. Our experiments with four different data sets show that the proposed techniques can eliminate a relatively large fraction of junk documents while dismissing only a significantly smaller fraction of potentially interesting documents. More... »

PAGES

287-299

References to SciGraph publications

  • 1998. Text categorization with Support Vector Machines: Learning with many relevant features in MACHINE LEARNING: ECML-98
  • 1998-06. A Tutorial on Support Vector Machines for Pattern Recognition in DATA MINING AND KNOWLEDGE DISCOVERY
  • 1996-08. Bagging predictors in MACHINE LEARNING
  • Book

    TITLE

    Advances in Information Retrieval

    ISBN

    978-3-540-25295-5
    978-3-540-31865-1

    Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/978-3-540-31865-1_21

    DOI

    http://dx.doi.org/10.1007/978-3-540-31865-1_21

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1033682131


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Artificial Intelligence and Image Processing", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "name": [
                "Max-Planck-Institute for Computer Science, Germany"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Siersdorfer", 
            "givenName": "Stefan", 
            "id": "sg:person.011411555201.43", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011411555201.43"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "name": [
                "Max-Planck-Institute for Computer Science, Germany"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Weikum", 
            "givenName": "Gerhard", 
            "id": "sg:person.010663162237.83", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.010663162237.83"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "sg:pub.10.1007/bf00058655", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1002929950", 
              "https://doi.org/10.1007/bf00058655"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/bf00058655", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1002929950", 
              "https://doi.org/10.1007/bf00058655"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1108/eb026637", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1009667911"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/1008992.1009032", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1012056956"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/345508.345593", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1020019156"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/307400.307419", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1020119095"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/s0893-6080(05)80023-1", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1020902633"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/1031171.1031184", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1023970965"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/288627.288651", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1024388005"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1023/a:1009715923555", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1042048349", 
              "https://doi.org/10.1023/a:1009715923555"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/956750.956778", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1049128124"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/bfb0026683", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1051853845", 
              "https://doi.org/10.1007/bfb0026683"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/sfcs.1989.63487", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1086226015"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/icdm.2002.1183999", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1093489185"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.3115/112405.112471", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1099203929"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2005", 
        "datePublishedReg": "2005-01-01", 
        "description": "This paper addresses the problem of performing supervised classification on document collections containing also junk documents. With \u201djunk documents\u201d we mean documents that do not belong to the topic categories (classes) we are interested in. This type of documents can typically not be covered by the training set; nevertheless in many real world applications (e.g. classification of web or intranet content, focused crawling etc.) such documents occur quite often and a classifier has to make a decision about them. We tackle this problem by using restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate classes with low confidence. Our experiments with four different data sets show that the proposed techniques can eliminate a relatively large fraction of junk documents while dismissing only a significantly smaller fraction of potentially interesting documents.", 
        "editor": [
          {
            "familyName": "Losada", 
            "givenName": "David E.", 
            "type": "Person"
          }, 
          {
            "familyName": "Fern\u00e1ndez-Luna", 
            "givenName": "Juan M.", 
            "type": "Person"
          }
        ], 
        "genre": "chapter", 
        "id": "sg:pub.10.1007/978-3-540-31865-1_21", 
        "inLanguage": [
          "en"
        ], 
        "isAccessibleForFree": true, 
        "isPartOf": {
          "isbn": [
            "978-3-540-25295-5", 
            "978-3-540-31865-1"
          ], 
          "name": "Advances in Information Retrieval", 
          "type": "Book"
        }, 
        "name": "Using Restrictive Classification and Meta Classification for Junk Elimination", 
        "pagination": "287-299", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1033682131"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/978-3-540-31865-1_21"
            ]
          }, 
          {
            "name": "readcube_id", 
            "type": "PropertyValue", 
            "value": [
              "a0737eb3b21b957c3843eef0ab8ae9702108942cb70425ff8dcaa1d2d76e3b23"
            ]
          }
        ], 
        "publisher": {
          "location": "Berlin, Heidelberg", 
          "name": "Springer Berlin Heidelberg", 
          "type": "Organisation"
        }, 
        "sameAs": [
          "https://doi.org/10.1007/978-3-540-31865-1_21", 
          "https://app.dimensions.ai/details/publication/pub.1033682131"
        ], 
        "sdDataset": "chapters", 
        "sdDatePublished": "2019-04-16T07:59", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000359_0000000359/records_29182_00000001.jsonl", 
        "type": "Chapter", 
        "url": "https://link.springer.com/10.1007%2F978-3-540-31865-1_21"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-31865-1_21'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-31865-1_21'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-31865-1_21'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-31865-1_21'


     

    This table displays all metadata directly associated to this object as RDF triples.

    123 TRIPLES      23 PREDICATES      41 URIs      20 LITERALS      8 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/978-3-540-31865-1_21 schema:about anzsrc-for:08
    2 anzsrc-for:0801
    3 schema:author N8f31072120014ca48e87dc64f3ae30a0
    4 schema:citation sg:pub.10.1007/bf00058655
    5 sg:pub.10.1007/bfb0026683
    6 sg:pub.10.1023/a:1009715923555
    7 https://doi.org/10.1016/s0893-6080(05)80023-1
    8 https://doi.org/10.1108/eb026637
    9 https://doi.org/10.1109/icdm.2002.1183999
    10 https://doi.org/10.1109/sfcs.1989.63487
    11 https://doi.org/10.1145/1008992.1009032
    12 https://doi.org/10.1145/1031171.1031184
    13 https://doi.org/10.1145/288627.288651
    14 https://doi.org/10.1145/307400.307419
    15 https://doi.org/10.1145/345508.345593
    16 https://doi.org/10.1145/956750.956778
    17 https://doi.org/10.3115/112405.112471
    18 schema:datePublished 2005
    19 schema:datePublishedReg 2005-01-01
    20 schema:description This paper addresses the problem of performing supervised classification on document collections containing also junk documents. With ”junk documents” we mean documents that do not belong to the topic categories (classes) we are interested in. This type of documents can typically not be covered by the training set; nevertheless in many real world applications (e.g. classification of web or intranet content, focused crawling etc.) such documents occur quite often and a classifier has to make a decision about them. We tackle this problem by using restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate classes with low confidence. Our experiments with four different data sets show that the proposed techniques can eliminate a relatively large fraction of junk documents while dismissing only a significantly smaller fraction of potentially interesting documents.
    21 schema:editor Nd0e23296b37c44c491c431d3fe4fdd80
    22 schema:genre chapter
    23 schema:inLanguage en
    24 schema:isAccessibleForFree true
    25 schema:isPartOf N1bc68e93c52b4472a6e5fab604fb8ade
    26 schema:name Using Restrictive Classification and Meta Classification for Junk Elimination
    27 schema:pagination 287-299
    28 schema:productId N345a5298159d485b8d1f4ba17d91278d
    29 N98622d596ae44ed085929b47c878f1c8
    30 Nb264c61c7b8c4551b0e7afbd82f9287c
    31 schema:publisher Nf07824202e0f42e39c5718456c3d00f1
    32 schema:sameAs https://app.dimensions.ai/details/publication/pub.1033682131
    33 https://doi.org/10.1007/978-3-540-31865-1_21
    34 schema:sdDatePublished 2019-04-16T07:59
    35 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    36 schema:sdPublisher Nc053a615fa8545dea31fb8116da6f2ba
    37 schema:url https://link.springer.com/10.1007%2F978-3-540-31865-1_21
    38 sgo:license sg:explorer/license/
    39 sgo:sdDataset chapters
    40 rdf:type schema:Chapter
    41 N19000fb389f44f5cac895f589631edc1 rdf:first N276a932d5b9b4e79b765266589aa77c0
    42 rdf:rest rdf:nil
    43 N1bc68e93c52b4472a6e5fab604fb8ade schema:isbn 978-3-540-25295-5
    44 978-3-540-31865-1
    45 schema:name Advances in Information Retrieval
    46 rdf:type schema:Book
    47 N276a932d5b9b4e79b765266589aa77c0 schema:familyName Fernández-Luna
    48 schema:givenName Juan M.
    49 rdf:type schema:Person
    50 N32922bc6850548a4810aabda6cc61c5c schema:name Max-Planck-Institute for Computer Science, Germany
    51 rdf:type schema:Organization
    52 N345a5298159d485b8d1f4ba17d91278d schema:name doi
    53 schema:value 10.1007/978-3-540-31865-1_21
    54 rdf:type schema:PropertyValue
    55 N6bb38f5461764dc2a8f7b43c2bfa77fe schema:familyName Losada
    56 schema:givenName David E.
    57 rdf:type schema:Person
    58 N8f31072120014ca48e87dc64f3ae30a0 rdf:first sg:person.011411555201.43
    59 rdf:rest Ndc95fc89cb56447db402337df9d5d5d6
    60 N98622d596ae44ed085929b47c878f1c8 schema:name dimensions_id
    61 schema:value pub.1033682131
    62 rdf:type schema:PropertyValue
    63 Nb264c61c7b8c4551b0e7afbd82f9287c schema:name readcube_id
    64 schema:value a0737eb3b21b957c3843eef0ab8ae9702108942cb70425ff8dcaa1d2d76e3b23
    65 rdf:type schema:PropertyValue
    66 Nbe82f8b994b04dd4b85f6deb695fadd7 schema:name Max-Planck-Institute for Computer Science, Germany
    67 rdf:type schema:Organization
    68 Nc053a615fa8545dea31fb8116da6f2ba schema:name Springer Nature - SN SciGraph project
    69 rdf:type schema:Organization
    70 Nd0e23296b37c44c491c431d3fe4fdd80 rdf:first N6bb38f5461764dc2a8f7b43c2bfa77fe
    71 rdf:rest N19000fb389f44f5cac895f589631edc1
    72 Ndc95fc89cb56447db402337df9d5d5d6 rdf:first sg:person.010663162237.83
    73 rdf:rest rdf:nil
    74 Nf07824202e0f42e39c5718456c3d00f1 schema:location Berlin, Heidelberg
    75 schema:name Springer Berlin Heidelberg
    76 rdf:type schema:Organisation
    77 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    78 schema:name Information and Computing Sciences
    79 rdf:type schema:DefinedTerm
    80 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
    81 schema:name Artificial Intelligence and Image Processing
    82 rdf:type schema:DefinedTerm
    83 sg:person.010663162237.83 schema:affiliation N32922bc6850548a4810aabda6cc61c5c
    84 schema:familyName Weikum
    85 schema:givenName Gerhard
    86 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.010663162237.83
    87 rdf:type schema:Person
    88 sg:person.011411555201.43 schema:affiliation Nbe82f8b994b04dd4b85f6deb695fadd7
    89 schema:familyName Siersdorfer
    90 schema:givenName Stefan
    91 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011411555201.43
    92 rdf:type schema:Person
    93 sg:pub.10.1007/bf00058655 schema:sameAs https://app.dimensions.ai/details/publication/pub.1002929950
    94 https://doi.org/10.1007/bf00058655
    95 rdf:type schema:CreativeWork
    96 sg:pub.10.1007/bfb0026683 schema:sameAs https://app.dimensions.ai/details/publication/pub.1051853845
    97 https://doi.org/10.1007/bfb0026683
    98 rdf:type schema:CreativeWork
    99 sg:pub.10.1023/a:1009715923555 schema:sameAs https://app.dimensions.ai/details/publication/pub.1042048349
    100 https://doi.org/10.1023/a:1009715923555
    101 rdf:type schema:CreativeWork
    102 https://doi.org/10.1016/s0893-6080(05)80023-1 schema:sameAs https://app.dimensions.ai/details/publication/pub.1020902633
    103 rdf:type schema:CreativeWork
    104 https://doi.org/10.1108/eb026637 schema:sameAs https://app.dimensions.ai/details/publication/pub.1009667911
    105 rdf:type schema:CreativeWork
    106 https://doi.org/10.1109/icdm.2002.1183999 schema:sameAs https://app.dimensions.ai/details/publication/pub.1093489185
    107 rdf:type schema:CreativeWork
    108 https://doi.org/10.1109/sfcs.1989.63487 schema:sameAs https://app.dimensions.ai/details/publication/pub.1086226015
    109 rdf:type schema:CreativeWork
    110 https://doi.org/10.1145/1008992.1009032 schema:sameAs https://app.dimensions.ai/details/publication/pub.1012056956
    111 rdf:type schema:CreativeWork
    112 https://doi.org/10.1145/1031171.1031184 schema:sameAs https://app.dimensions.ai/details/publication/pub.1023970965
    113 rdf:type schema:CreativeWork
    114 https://doi.org/10.1145/288627.288651 schema:sameAs https://app.dimensions.ai/details/publication/pub.1024388005
    115 rdf:type schema:CreativeWork
    116 https://doi.org/10.1145/307400.307419 schema:sameAs https://app.dimensions.ai/details/publication/pub.1020119095
    117 rdf:type schema:CreativeWork
    118 https://doi.org/10.1145/345508.345593 schema:sameAs https://app.dimensions.ai/details/publication/pub.1020019156
    119 rdf:type schema:CreativeWork
    120 https://doi.org/10.1145/956750.956778 schema:sameAs https://app.dimensions.ai/details/publication/pub.1049128124
    121 rdf:type schema:CreativeWork
    122 https://doi.org/10.3115/112405.112471 schema:sameAs https://app.dimensions.ai/details/publication/pub.1099203929
    123 rdf:type schema:CreativeWork
     




    Preview window. Press ESC to close (or click here)


    ...