Ontology type: schema:Chapter Open Access: True
2005
AUTHORSStefan Siersdorfer , Gerhard Weikum
ABSTRACTThis paper addresses the problem of performing supervised classification on document collections containing also junk documents. With ”junk documents” we mean documents that do not belong to the topic categories (classes) we are interested in. This type of documents can typically not be covered by the training set; nevertheless in many real world applications (e.g. classification of web or intranet content, focused crawling etc.) such documents occur quite often and a classifier has to make a decision about them. We tackle this problem by using restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate classes with low confidence. Our experiments with four different data sets show that the proposed techniques can eliminate a relatively large fraction of junk documents while dismissing only a significantly smaller fraction of potentially interesting documents. More... »
PAGES287-299
Advances in Information Retrieval
ISBN
978-3-540-25295-5
978-3-540-31865-1
http://scigraph.springernature.com/pub.10.1007/978-3-540-31865-1_21
DOIhttp://dx.doi.org/10.1007/978-3-540-31865-1_21
DIMENSIONShttps://app.dimensions.ai/details/publication/pub.1033682131
JSON-LD is the canonical representation for SciGraph data.
TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT
[
{
"@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json",
"about": [
{
"id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801",
"inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/",
"name": "Artificial Intelligence and Image Processing",
"type": "DefinedTerm"
},
{
"id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08",
"inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/",
"name": "Information and Computing Sciences",
"type": "DefinedTerm"
}
],
"author": [
{
"affiliation": {
"name": [
"Max-Planck-Institute for Computer Science, Germany"
],
"type": "Organization"
},
"familyName": "Siersdorfer",
"givenName": "Stefan",
"id": "sg:person.011411555201.43",
"sameAs": [
"https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011411555201.43"
],
"type": "Person"
},
{
"affiliation": {
"name": [
"Max-Planck-Institute for Computer Science, Germany"
],
"type": "Organization"
},
"familyName": "Weikum",
"givenName": "Gerhard",
"id": "sg:person.010663162237.83",
"sameAs": [
"https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.010663162237.83"
],
"type": "Person"
}
],
"citation": [
{
"id": "sg:pub.10.1007/bf00058655",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1002929950",
"https://doi.org/10.1007/bf00058655"
],
"type": "CreativeWork"
},
{
"id": "sg:pub.10.1007/bf00058655",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1002929950",
"https://doi.org/10.1007/bf00058655"
],
"type": "CreativeWork"
},
{
"id": "https://doi.org/10.1108/eb026637",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1009667911"
],
"type": "CreativeWork"
},
{
"id": "https://doi.org/10.1145/1008992.1009032",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1012056956"
],
"type": "CreativeWork"
},
{
"id": "https://doi.org/10.1145/345508.345593",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1020019156"
],
"type": "CreativeWork"
},
{
"id": "https://doi.org/10.1145/307400.307419",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1020119095"
],
"type": "CreativeWork"
},
{
"id": "https://doi.org/10.1016/s0893-6080(05)80023-1",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1020902633"
],
"type": "CreativeWork"
},
{
"id": "https://doi.org/10.1145/1031171.1031184",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1023970965"
],
"type": "CreativeWork"
},
{
"id": "https://doi.org/10.1145/288627.288651",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1024388005"
],
"type": "CreativeWork"
},
{
"id": "sg:pub.10.1023/a:1009715923555",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1042048349",
"https://doi.org/10.1023/a:1009715923555"
],
"type": "CreativeWork"
},
{
"id": "https://doi.org/10.1145/956750.956778",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1049128124"
],
"type": "CreativeWork"
},
{
"id": "sg:pub.10.1007/bfb0026683",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1051853845",
"https://doi.org/10.1007/bfb0026683"
],
"type": "CreativeWork"
},
{
"id": "https://doi.org/10.1109/sfcs.1989.63487",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1086226015"
],
"type": "CreativeWork"
},
{
"id": "https://doi.org/10.1109/icdm.2002.1183999",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1093489185"
],
"type": "CreativeWork"
},
{
"id": "https://doi.org/10.3115/112405.112471",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1099203929"
],
"type": "CreativeWork"
}
],
"datePublished": "2005",
"datePublishedReg": "2005-01-01",
"description": "This paper addresses the problem of performing supervised classification on document collections containing also junk documents. With \u201djunk documents\u201d we mean documents that do not belong to the topic categories (classes) we are interested in. This type of documents can typically not be covered by the training set; nevertheless in many real world applications (e.g. classification of web or intranet content, focused crawling etc.) such documents occur quite often and a classifier has to make a decision about them. We tackle this problem by using restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate classes with low confidence. Our experiments with four different data sets show that the proposed techniques can eliminate a relatively large fraction of junk documents while dismissing only a significantly smaller fraction of potentially interesting documents.",
"editor": [
{
"familyName": "Losada",
"givenName": "David E.",
"type": "Person"
},
{
"familyName": "Fern\u00e1ndez-Luna",
"givenName": "Juan M.",
"type": "Person"
}
],
"genre": "chapter",
"id": "sg:pub.10.1007/978-3-540-31865-1_21",
"inLanguage": [
"en"
],
"isAccessibleForFree": true,
"isPartOf": {
"isbn": [
"978-3-540-25295-5",
"978-3-540-31865-1"
],
"name": "Advances in Information Retrieval",
"type": "Book"
},
"name": "Using Restrictive Classification and Meta Classification for Junk Elimination",
"pagination": "287-299",
"productId": [
{
"name": "dimensions_id",
"type": "PropertyValue",
"value": [
"pub.1033682131"
]
},
{
"name": "doi",
"type": "PropertyValue",
"value": [
"10.1007/978-3-540-31865-1_21"
]
},
{
"name": "readcube_id",
"type": "PropertyValue",
"value": [
"a0737eb3b21b957c3843eef0ab8ae9702108942cb70425ff8dcaa1d2d76e3b23"
]
}
],
"publisher": {
"location": "Berlin, Heidelberg",
"name": "Springer Berlin Heidelberg",
"type": "Organisation"
},
"sameAs": [
"https://doi.org/10.1007/978-3-540-31865-1_21",
"https://app.dimensions.ai/details/publication/pub.1033682131"
],
"sdDataset": "chapters",
"sdDatePublished": "2019-04-16T07:59",
"sdLicense": "https://scigraph.springernature.com/explorer/license/",
"sdPublisher": {
"name": "Springer Nature - SN SciGraph project",
"type": "Organization"
},
"sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000359_0000000359/records_29182_00000001.jsonl",
"type": "Chapter",
"url": "https://link.springer.com/10.1007%2F978-3-540-31865-1_21"
}
]
Download the RDF metadata as: json-ld nt turtle xml License info
JSON-LD is a popular format for linked data which is fully compatible with JSON.
curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-31865-1_21'
N-Triples is a line-based linked data format ideal for batch operations.
curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-31865-1_21'
Turtle is a human-readable linked data format.
curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-31865-1_21'
RDF/XML is a standard XML format for linked data.
curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-31865-1_21'
This table displays all metadata directly associated to this object as RDF triples.
123 TRIPLES
23 PREDICATES
41 URIs
20 LITERALS
8 BLANK NODES