How Effective is Stemming and Decompounding for German Text Retrieval? View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2004-09

AUTHORS

Martin Braschler, Bärbel Ripplinger

ABSTRACT

Information retrieval systems operating on free text face difficulties when word forms used in the query and documents do not match. The usual solution is the use of a “stemming component” that reduces related word forms to a common stem. Extensive studies of such components exist for English, but considerably less is known for other languages. Previously, it has been claimed that stemming is essential for highly declensional languages. We report on our experiments on stemming for German, where an additional issue is the handling of compounds, which are formed by concatenating several words. The major contribution of our work that goes beyond its focus on German lies in the investigation of a complete spectrum of approaches, ranging from language-independent to elaborate linguistic methods. The main findings are that stemming is beneficial even when using a simple approach, and that carefully designed decompounding, the splitting of compound words, remarkably boosts performance. All findings are based on a thorough analysis using a large reliable test collection. More... »

PAGES

291-316

References to SciGraph publications

  • 2002. Stemming Evaluated in 6 Languages by Hummingbird SearchServer™ at CLEF 2001 in EVALUATION OF CROSS-LANGUAGE INFORMATION RETRIEVAL SYSTEMS
  • 2002. Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian in EVALUATION OF CROSS-LANGUAGE INFORMATION RETRIEVAL SYSTEMS
  • 2001. Experiments with the Eurospider Retrieval System for CLEF 2000 in CROSS-LANGUAGE INFORMATION RETRIEVAL AND EVALUATION
  • Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1023/b:inrt.0000011208.60754.a1

    DOI

    http://dx.doi.org/10.1023/b:inrt.0000011208.60754.a1

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1009370025


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/2004", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Linguistics", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/20", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Language, Communication and Culture", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "University of Neuch\u00e2tel", 
              "id": "https://www.grid.ac/institutes/grid.10711.36", 
              "name": [
                "Eurospider Information Technology AG, Schaffhauserstrasse 18, CH-8006, Z\u00fcrich", 
                "Institut Interfacultaire d'Informatique, Switzerland; Universit\u00e9 de Neuch\u00e2tel, Pierre-\u00e0-Mazel 7, CH-2001, Neuch\u00e2tel, Switzerland"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Braschler", 
            "givenName": "Martin", 
            "id": "sg:person.015363630667.99", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015363630667.99"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "Eurospider Information Technology (Switzerland)", 
              "id": "https://www.grid.ac/institutes/grid.433769.c", 
              "name": [
                "Eurospider Information Technology AG, Schaffhauserstrasse 18, CH-8006, Z\u00fcrich, Switzerland"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Ripplinger", 
            "givenName": "B\u00e4rbel", 
            "id": "sg:person.0765000564.12", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0765000564.12"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "https://doi.org/10.1002/(sici)1097-4571(199601)47:1<70::aid-asi7>3.0.co;2-#", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1006353423"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1162/089120101750300490", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1008516626"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/s0306-4573(02)00018-3", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1009347350"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/243199.243209", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1010732909"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1002/(sici)1097-4571(199206)43:5<384::aid-asi6>3.0.co;2-l", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1018240749"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-45691-0_24", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1020330513", 
              "https://doi.org/10.1007/3-540-45691-0_24"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/160688.160758", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1021980312"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1002/(sici)1097-4571(1999)50:10<944::aid-asi9>3.0.co;2-q", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1024181092"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-45691-0_25", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1026635841", 
              "https://doi.org/10.1007/3-540-45691-0_25"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/243199.243206", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1032416718"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1002/(sici)1097-4571(199101)42:1<7::aid-asi2>3.0.co;2-p", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1034971043"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1108/eb046814", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1037275209"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-44645-1_13", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1047282467", 
              "https://doi.org/10.1007/3-540-44645-1_13"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/243199.243213", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1047493755"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2004-09", 
        "datePublishedReg": "2004-09-01", 
        "description": "Information retrieval systems operating on free text face difficulties when word forms used in the query and documents do not match. The usual solution is the use of a \u201cstemming component\u201d that reduces related word forms to a common stem. Extensive studies of such components exist for English, but considerably less is known for other languages. Previously, it has been claimed that stemming is essential for highly declensional languages. We report on our experiments on stemming for German, where an additional issue is the handling of compounds, which are formed by concatenating several words. The major contribution of our work that goes beyond its focus on German lies in the investigation of a complete spectrum of approaches, ranging from language-independent to elaborate linguistic methods. The main findings are that stemming is beneficial even when using a simple approach, and that carefully designed decompounding, the splitting of compound words, remarkably boosts performance. All findings are based on a thorough analysis using a large reliable test collection.", 
        "genre": "research_article", 
        "id": "sg:pub.10.1023/b:inrt.0000011208.60754.a1", 
        "inLanguage": [
          "en"
        ], 
        "isAccessibleForFree": true, 
        "isPartOf": [
          {
            "id": "sg:journal.1023664", 
            "issn": [
              "1386-4564", 
              "1573-7659"
            ], 
            "name": "Information Retrieval Journal", 
            "type": "Periodical"
          }, 
          {
            "issueNumber": "3-4", 
            "type": "PublicationIssue"
          }, 
          {
            "type": "PublicationVolume", 
            "volumeNumber": "7"
          }
        ], 
        "name": "How Effective is Stemming and Decompounding for German Text Retrieval?", 
        "pagination": "291-316", 
        "productId": [
          {
            "name": "readcube_id", 
            "type": "PropertyValue", 
            "value": [
              "2ca2ef5cc6a06899cd270226edd2c9392c2c9d2d2867d34a2526f3d31caeddf3"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1023/b:inrt.0000011208.60754.a1"
            ]
          }, 
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1009370025"
            ]
          }
        ], 
        "sameAs": [
          "https://doi.org/10.1023/b:inrt.0000011208.60754.a1", 
          "https://app.dimensions.ai/details/publication/pub.1009370025"
        ], 
        "sdDataset": "articles", 
        "sdDatePublished": "2019-04-10T14:15", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8660_00000536.jsonl", 
        "type": "ScholarlyArticle", 
        "url": "http://link.springer.com/10.1023%2FB%3AINRT.0000011208.60754.a1"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1023/b:inrt.0000011208.60754.a1'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1023/b:inrt.0000011208.60754.a1'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1023/b:inrt.0000011208.60754.a1'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1023/b:inrt.0000011208.60754.a1'


     

    This table displays all metadata directly associated to this object as RDF triples.

    117 TRIPLES      21 PREDICATES      41 URIs      19 LITERALS      7 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1023/b:inrt.0000011208.60754.a1 schema:about anzsrc-for:20
    2 anzsrc-for:2004
    3 schema:author Ndeeea4b4f4324b849d9a63cf52fa185c
    4 schema:citation sg:pub.10.1007/3-540-44645-1_13
    5 sg:pub.10.1007/3-540-45691-0_24
    6 sg:pub.10.1007/3-540-45691-0_25
    7 https://doi.org/10.1002/(sici)1097-4571(199101)42:1<7::aid-asi2>3.0.co;2-p
    8 https://doi.org/10.1002/(sici)1097-4571(199206)43:5<384::aid-asi6>3.0.co;2-l
    9 https://doi.org/10.1002/(sici)1097-4571(199601)47:1<70::aid-asi7>3.0.co;2-#
    10 https://doi.org/10.1002/(sici)1097-4571(1999)50:10<944::aid-asi9>3.0.co;2-q
    11 https://doi.org/10.1016/s0306-4573(02)00018-3
    12 https://doi.org/10.1108/eb046814
    13 https://doi.org/10.1145/160688.160758
    14 https://doi.org/10.1145/243199.243206
    15 https://doi.org/10.1145/243199.243209
    16 https://doi.org/10.1145/243199.243213
    17 https://doi.org/10.1162/089120101750300490
    18 schema:datePublished 2004-09
    19 schema:datePublishedReg 2004-09-01
    20 schema:description Information retrieval systems operating on free text face difficulties when word forms used in the query and documents do not match. The usual solution is the use of a “stemming component” that reduces related word forms to a common stem. Extensive studies of such components exist for English, but considerably less is known for other languages. Previously, it has been claimed that stemming is essential for highly declensional languages. We report on our experiments on stemming for German, where an additional issue is the handling of compounds, which are formed by concatenating several words. The major contribution of our work that goes beyond its focus on German lies in the investigation of a complete spectrum of approaches, ranging from language-independent to elaborate linguistic methods. The main findings are that stemming is beneficial even when using a simple approach, and that carefully designed decompounding, the splitting of compound words, remarkably boosts performance. All findings are based on a thorough analysis using a large reliable test collection.
    21 schema:genre research_article
    22 schema:inLanguage en
    23 schema:isAccessibleForFree true
    24 schema:isPartOf N48768aa6c2d644f29731a23201133232
    25 N659806e36d844b1fac64677d75217f2f
    26 sg:journal.1023664
    27 schema:name How Effective is Stemming and Decompounding for German Text Retrieval?
    28 schema:pagination 291-316
    29 schema:productId N1bed79a18130412fb064c469a348b5fe
    30 N8a587165e9a14b33ab5b86ed46afa9f4
    31 Ndae4a30a159048539fd7cd04b4f3fc83
    32 schema:sameAs https://app.dimensions.ai/details/publication/pub.1009370025
    33 https://doi.org/10.1023/b:inrt.0000011208.60754.a1
    34 schema:sdDatePublished 2019-04-10T14:15
    35 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    36 schema:sdPublisher N5b9d5d0237ca4d81a15d04d145d0b16b
    37 schema:url http://link.springer.com/10.1023%2FB%3AINRT.0000011208.60754.a1
    38 sgo:license sg:explorer/license/
    39 sgo:sdDataset articles
    40 rdf:type schema:ScholarlyArticle
    41 N1bed79a18130412fb064c469a348b5fe schema:name dimensions_id
    42 schema:value pub.1009370025
    43 rdf:type schema:PropertyValue
    44 N48768aa6c2d644f29731a23201133232 schema:issueNumber 3-4
    45 rdf:type schema:PublicationIssue
    46 N5b9d5d0237ca4d81a15d04d145d0b16b schema:name Springer Nature - SN SciGraph project
    47 rdf:type schema:Organization
    48 N659806e36d844b1fac64677d75217f2f schema:volumeNumber 7
    49 rdf:type schema:PublicationVolume
    50 N8a587165e9a14b33ab5b86ed46afa9f4 schema:name readcube_id
    51 schema:value 2ca2ef5cc6a06899cd270226edd2c9392c2c9d2d2867d34a2526f3d31caeddf3
    52 rdf:type schema:PropertyValue
    53 Naa60b3032a9c4d16ad8cb3c5c4b0a022 rdf:first sg:person.0765000564.12
    54 rdf:rest rdf:nil
    55 Ndae4a30a159048539fd7cd04b4f3fc83 schema:name doi
    56 schema:value 10.1023/b:inrt.0000011208.60754.a1
    57 rdf:type schema:PropertyValue
    58 Ndeeea4b4f4324b849d9a63cf52fa185c rdf:first sg:person.015363630667.99
    59 rdf:rest Naa60b3032a9c4d16ad8cb3c5c4b0a022
    60 anzsrc-for:20 schema:inDefinedTermSet anzsrc-for:
    61 schema:name Language, Communication and Culture
    62 rdf:type schema:DefinedTerm
    63 anzsrc-for:2004 schema:inDefinedTermSet anzsrc-for:
    64 schema:name Linguistics
    65 rdf:type schema:DefinedTerm
    66 sg:journal.1023664 schema:issn 1386-4564
    67 1573-7659
    68 schema:name Information Retrieval Journal
    69 rdf:type schema:Periodical
    70 sg:person.015363630667.99 schema:affiliation https://www.grid.ac/institutes/grid.10711.36
    71 schema:familyName Braschler
    72 schema:givenName Martin
    73 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015363630667.99
    74 rdf:type schema:Person
    75 sg:person.0765000564.12 schema:affiliation https://www.grid.ac/institutes/grid.433769.c
    76 schema:familyName Ripplinger
    77 schema:givenName Bärbel
    78 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0765000564.12
    79 rdf:type schema:Person
    80 sg:pub.10.1007/3-540-44645-1_13 schema:sameAs https://app.dimensions.ai/details/publication/pub.1047282467
    81 https://doi.org/10.1007/3-540-44645-1_13
    82 rdf:type schema:CreativeWork
    83 sg:pub.10.1007/3-540-45691-0_24 schema:sameAs https://app.dimensions.ai/details/publication/pub.1020330513
    84 https://doi.org/10.1007/3-540-45691-0_24
    85 rdf:type schema:CreativeWork
    86 sg:pub.10.1007/3-540-45691-0_25 schema:sameAs https://app.dimensions.ai/details/publication/pub.1026635841
    87 https://doi.org/10.1007/3-540-45691-0_25
    88 rdf:type schema:CreativeWork
    89 https://doi.org/10.1002/(sici)1097-4571(199101)42:1<7::aid-asi2>3.0.co;2-p schema:sameAs https://app.dimensions.ai/details/publication/pub.1034971043
    90 rdf:type schema:CreativeWork
    91 https://doi.org/10.1002/(sici)1097-4571(199206)43:5<384::aid-asi6>3.0.co;2-l schema:sameAs https://app.dimensions.ai/details/publication/pub.1018240749
    92 rdf:type schema:CreativeWork
    93 https://doi.org/10.1002/(sici)1097-4571(199601)47:1<70::aid-asi7>3.0.co;2-# schema:sameAs https://app.dimensions.ai/details/publication/pub.1006353423
    94 rdf:type schema:CreativeWork
    95 https://doi.org/10.1002/(sici)1097-4571(1999)50:10<944::aid-asi9>3.0.co;2-q schema:sameAs https://app.dimensions.ai/details/publication/pub.1024181092
    96 rdf:type schema:CreativeWork
    97 https://doi.org/10.1016/s0306-4573(02)00018-3 schema:sameAs https://app.dimensions.ai/details/publication/pub.1009347350
    98 rdf:type schema:CreativeWork
    99 https://doi.org/10.1108/eb046814 schema:sameAs https://app.dimensions.ai/details/publication/pub.1037275209
    100 rdf:type schema:CreativeWork
    101 https://doi.org/10.1145/160688.160758 schema:sameAs https://app.dimensions.ai/details/publication/pub.1021980312
    102 rdf:type schema:CreativeWork
    103 https://doi.org/10.1145/243199.243206 schema:sameAs https://app.dimensions.ai/details/publication/pub.1032416718
    104 rdf:type schema:CreativeWork
    105 https://doi.org/10.1145/243199.243209 schema:sameAs https://app.dimensions.ai/details/publication/pub.1010732909
    106 rdf:type schema:CreativeWork
    107 https://doi.org/10.1145/243199.243213 schema:sameAs https://app.dimensions.ai/details/publication/pub.1047493755
    108 rdf:type schema:CreativeWork
    109 https://doi.org/10.1162/089120101750300490 schema:sameAs https://app.dimensions.ai/details/publication/pub.1008516626
    110 rdf:type schema:CreativeWork
    111 https://www.grid.ac/institutes/grid.10711.36 schema:alternateName University of Neuchâtel
    112 schema:name Eurospider Information Technology AG, Schaffhauserstrasse 18, CH-8006, Zürich
    113 Institut Interfacultaire d'Informatique, Switzerland; Université de Neuchâtel, Pierre-à-Mazel 7, CH-2001, Neuchâtel, Switzerland
    114 rdf:type schema:Organization
    115 https://www.grid.ac/institutes/grid.433769.c schema:alternateName Eurospider Information Technology (Switzerland)
    116 schema:name Eurospider Information Technology AG, Schaffhauserstrasse 18, CH-8006, Zürich, Switzerland
    117 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...