Stemming and Decompounding for German Text Retrieval View Full Text


Ontology type: schema:Chapter     


Chapter Info

DATE

2003

AUTHORS

Martin Braschler , Bärbel Ripplinger

ABSTRACT

The stemming problem, i.e. finding a common stem for different forms of a term, has been extensively studied for English, but considerably less is known for other languages. Previously, it has been claimed that stemming is essential for highly declensional languages. We report on our experiments on stemming for German, where an additional issue is the handling of compounds, which are formed by concatenating several words. Rarely do studies on stemming for any language cover more than one or two different approaches. This paper makes a major contribution that transcends its focus on German by investigating a complete spectrum of approaches, ranging from language-independent to elaborate linguistic methods. The main findings are that stemming is beneficial even when using a simple approach, and that carefully designed decompounding, the splitting of compound words, remarkably boosts performance. All findings are based on a thorough analysis using a large reliable test collection. More... »

PAGES

177-192

References to SciGraph publications

  • 2002. Stemming Evaluated in 6 Languages by Hummingbird SearchServer™ at CLEF 2001 in EVALUATION OF CROSS-LANGUAGE INFORMATION RETRIEVAL SYSTEMS
  • 2002. Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian in EVALUATION OF CROSS-LANGUAGE INFORMATION RETRIEVAL SYSTEMS
  • 2001. West Group at CLEF 2000: Non-english Monolingual Retrieval in CROSS-LANGUAGE INFORMATION RETRIEVAL AND EVALUATION
  • 2001. Experiments with the Eurospider Retrieval System for CLEF 2000 in CROSS-LANGUAGE INFORMATION RETRIEVAL AND EVALUATION
  • 2002. Mpro-IR in CLEF 2001 in EVALUATION OF CROSS-LANGUAGE INFORMATION RETRIEVAL SYSTEMS
  • Book

    TITLE

    Advances in Information Retrieval

    ISBN

    978-3-540-01274-0
    978-3-540-36618-8

    Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/3-540-36618-0_13

    DOI

    http://dx.doi.org/10.1007/3-540-36618-0_13

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1031488568


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/2004", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Linguistics", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/20", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Language, Communication and Culture", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "Eurospider Information Technology (Switzerland)", 
              "id": "https://www.grid.ac/institutes/grid.433769.c", 
              "name": [
                "Eurospider Information Technology AG, Schaffhauserstrasse 18, CH-8006\u00a0Z\u00fcrich, Switzerland"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Braschler", 
            "givenName": "Martin", 
            "id": "sg:person.015363630667.99", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015363630667.99"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "Eurospider Information Technology (Switzerland)", 
              "id": "https://www.grid.ac/institutes/grid.433769.c", 
              "name": [
                "Eurospider Information Technology AG, Schaffhauserstrasse 18, CH-8006\u00a0Z\u00fcrich, Switzerland"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Ripplinger", 
            "givenName": "B\u00e4rbel", 
            "id": "sg:person.0765000564.12", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0765000564.12"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "sg:pub.10.1007/3-540-44645-1_25", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1003666217", 
              "https://doi.org/10.1007/3-540-44645-1_25"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1002/(sici)1097-4571(199601)47:1<70::aid-asi7>3.0.co;2-#", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1006353423"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1162/089120101750300490", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1008516626"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/s0306-4573(02)00018-3", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1009347350"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/243199.243209", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1010732909"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1002/(sici)1097-4571(199206)43:5<384::aid-asi6>3.0.co;2-l", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1018240749"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-45691-0_24", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1020330513", 
              "https://doi.org/10.1007/3-540-45691-0_24"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/160688.160758", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1021980312"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1002/(sici)1097-4571(1999)50:10<944::aid-asi9>3.0.co;2-q", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1024181092"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-45691-0_25", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1026635841", 
              "https://doi.org/10.1007/3-540-45691-0_25"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/0306-4573(92)90005-k", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1027935191"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/0306-4573(92)90005-k", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1027935191"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1002/(sici)1097-4571(199101)42:1<7::aid-asi2>3.0.co;2-p", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1034971043"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1108/eb046814", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1037275209"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-45691-0_28", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1043771341", 
              "https://doi.org/10.1007/3-540-45691-0_28"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-44645-1_13", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1047282467", 
              "https://doi.org/10.1007/3-540-44645-1_13"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/243199.243213", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1047493755"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2003", 
        "datePublishedReg": "2003-01-01", 
        "description": "The stemming problem, i.e. finding a common stem for different forms of a term, has been extensively studied for English, but considerably less is known for other languages. Previously, it has been claimed that stemming is essential for highly declensional languages. We report on our experiments on stemming for German, where an additional issue is the handling of compounds, which are formed by concatenating several words. Rarely do studies on stemming for any language cover more than one or two different approaches. This paper makes a major contribution that transcends its focus on German by investigating a complete spectrum of approaches, ranging from language-independent to elaborate linguistic methods. The main findings are that stemming is beneficial even when using a simple approach, and that carefully designed decompounding, the splitting of compound words, remarkably boosts performance. All findings are based on a thorough analysis using a large reliable test collection.", 
        "editor": [
          {
            "familyName": "Sebastiani", 
            "givenName": "Fabrizio", 
            "type": "Person"
          }
        ], 
        "genre": "chapter", 
        "id": "sg:pub.10.1007/3-540-36618-0_13", 
        "inLanguage": [
          "en"
        ], 
        "isAccessibleForFree": false, 
        "isPartOf": {
          "isbn": [
            "978-3-540-01274-0", 
            "978-3-540-36618-8"
          ], 
          "name": "Advances in Information Retrieval", 
          "type": "Book"
        }, 
        "name": "Stemming and Decompounding for German Text Retrieval", 
        "pagination": "177-192", 
        "productId": [
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/3-540-36618-0_13"
            ]
          }, 
          {
            "name": "readcube_id", 
            "type": "PropertyValue", 
            "value": [
              "9601542cfd9d6ca3b1416c1a7d96a885bd6dc2884b4d49ea74b4388a848a9520"
            ]
          }, 
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1031488568"
            ]
          }
        ], 
        "publisher": {
          "location": "Berlin, Heidelberg", 
          "name": "Springer Berlin Heidelberg", 
          "type": "Organisation"
        }, 
        "sameAs": [
          "https://doi.org/10.1007/3-540-36618-0_13", 
          "https://app.dimensions.ai/details/publication/pub.1031488568"
        ], 
        "sdDataset": "chapters", 
        "sdDatePublished": "2019-04-15T14:25", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8669_00000262.jsonl", 
        "type": "Chapter", 
        "url": "http://link.springer.com/10.1007/3-540-36618-0_13"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/3-540-36618-0_13'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/3-540-36618-0_13'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/3-540-36618-0_13'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/3-540-36618-0_13'


     

    This table displays all metadata directly associated to this object as RDF triples.

    125 TRIPLES      23 PREDICATES      43 URIs      20 LITERALS      8 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/3-540-36618-0_13 schema:about anzsrc-for:20
    2 anzsrc-for:2004
    3 schema:author Nff9194fbe4d34e8f877c4f3af63128cc
    4 schema:citation sg:pub.10.1007/3-540-44645-1_13
    5 sg:pub.10.1007/3-540-44645-1_25
    6 sg:pub.10.1007/3-540-45691-0_24
    7 sg:pub.10.1007/3-540-45691-0_25
    8 sg:pub.10.1007/3-540-45691-0_28
    9 https://doi.org/10.1002/(sici)1097-4571(199101)42:1<7::aid-asi2>3.0.co;2-p
    10 https://doi.org/10.1002/(sici)1097-4571(199206)43:5<384::aid-asi6>3.0.co;2-l
    11 https://doi.org/10.1002/(sici)1097-4571(199601)47:1<70::aid-asi7>3.0.co;2-#
    12 https://doi.org/10.1002/(sici)1097-4571(1999)50:10<944::aid-asi9>3.0.co;2-q
    13 https://doi.org/10.1016/0306-4573(92)90005-k
    14 https://doi.org/10.1016/s0306-4573(02)00018-3
    15 https://doi.org/10.1108/eb046814
    16 https://doi.org/10.1145/160688.160758
    17 https://doi.org/10.1145/243199.243209
    18 https://doi.org/10.1145/243199.243213
    19 https://doi.org/10.1162/089120101750300490
    20 schema:datePublished 2003
    21 schema:datePublishedReg 2003-01-01
    22 schema:description The stemming problem, i.e. finding a common stem for different forms of a term, has been extensively studied for English, but considerably less is known for other languages. Previously, it has been claimed that stemming is essential for highly declensional languages. We report on our experiments on stemming for German, where an additional issue is the handling of compounds, which are formed by concatenating several words. Rarely do studies on stemming for any language cover more than one or two different approaches. This paper makes a major contribution that transcends its focus on German by investigating a complete spectrum of approaches, ranging from language-independent to elaborate linguistic methods. The main findings are that stemming is beneficial even when using a simple approach, and that carefully designed decompounding, the splitting of compound words, remarkably boosts performance. All findings are based on a thorough analysis using a large reliable test collection.
    23 schema:editor N074e6634871749e3b02165ae3b755307
    24 schema:genre chapter
    25 schema:inLanguage en
    26 schema:isAccessibleForFree false
    27 schema:isPartOf N75e9f495c9db4845b2e93c333c9d6c73
    28 schema:name Stemming and Decompounding for German Text Retrieval
    29 schema:pagination 177-192
    30 schema:productId Ne2487856c27e466e9e44347cf5e40095
    31 Ned1c6461084444258aeea5b4313098b8
    32 Nf82d5a460c4a4f1bb0afd3b789453526
    33 schema:publisher Na7bc718cf9154c13bf0f875dd749bb3b
    34 schema:sameAs https://app.dimensions.ai/details/publication/pub.1031488568
    35 https://doi.org/10.1007/3-540-36618-0_13
    36 schema:sdDatePublished 2019-04-15T14:25
    37 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    38 schema:sdPublisher N4a9ca05d3f7e4028be4ffc6ca264c3c4
    39 schema:url http://link.springer.com/10.1007/3-540-36618-0_13
    40 sgo:license sg:explorer/license/
    41 sgo:sdDataset chapters
    42 rdf:type schema:Chapter
    43 N074e6634871749e3b02165ae3b755307 rdf:first Nd94e9b4232a64a3baf5e2c4af9b066c7
    44 rdf:rest rdf:nil
    45 N0fcc074e768a4ea7ab45ac53df626588 rdf:first sg:person.0765000564.12
    46 rdf:rest rdf:nil
    47 N4a9ca05d3f7e4028be4ffc6ca264c3c4 schema:name Springer Nature - SN SciGraph project
    48 rdf:type schema:Organization
    49 N75e9f495c9db4845b2e93c333c9d6c73 schema:isbn 978-3-540-01274-0
    50 978-3-540-36618-8
    51 schema:name Advances in Information Retrieval
    52 rdf:type schema:Book
    53 Na7bc718cf9154c13bf0f875dd749bb3b schema:location Berlin, Heidelberg
    54 schema:name Springer Berlin Heidelberg
    55 rdf:type schema:Organisation
    56 Nd94e9b4232a64a3baf5e2c4af9b066c7 schema:familyName Sebastiani
    57 schema:givenName Fabrizio
    58 rdf:type schema:Person
    59 Ne2487856c27e466e9e44347cf5e40095 schema:name dimensions_id
    60 schema:value pub.1031488568
    61 rdf:type schema:PropertyValue
    62 Ned1c6461084444258aeea5b4313098b8 schema:name readcube_id
    63 schema:value 9601542cfd9d6ca3b1416c1a7d96a885bd6dc2884b4d49ea74b4388a848a9520
    64 rdf:type schema:PropertyValue
    65 Nf82d5a460c4a4f1bb0afd3b789453526 schema:name doi
    66 schema:value 10.1007/3-540-36618-0_13
    67 rdf:type schema:PropertyValue
    68 Nff9194fbe4d34e8f877c4f3af63128cc rdf:first sg:person.015363630667.99
    69 rdf:rest N0fcc074e768a4ea7ab45ac53df626588
    70 anzsrc-for:20 schema:inDefinedTermSet anzsrc-for:
    71 schema:name Language, Communication and Culture
    72 rdf:type schema:DefinedTerm
    73 anzsrc-for:2004 schema:inDefinedTermSet anzsrc-for:
    74 schema:name Linguistics
    75 rdf:type schema:DefinedTerm
    76 sg:person.015363630667.99 schema:affiliation https://www.grid.ac/institutes/grid.433769.c
    77 schema:familyName Braschler
    78 schema:givenName Martin
    79 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015363630667.99
    80 rdf:type schema:Person
    81 sg:person.0765000564.12 schema:affiliation https://www.grid.ac/institutes/grid.433769.c
    82 schema:familyName Ripplinger
    83 schema:givenName Bärbel
    84 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0765000564.12
    85 rdf:type schema:Person
    86 sg:pub.10.1007/3-540-44645-1_13 schema:sameAs https://app.dimensions.ai/details/publication/pub.1047282467
    87 https://doi.org/10.1007/3-540-44645-1_13
    88 rdf:type schema:CreativeWork
    89 sg:pub.10.1007/3-540-44645-1_25 schema:sameAs https://app.dimensions.ai/details/publication/pub.1003666217
    90 https://doi.org/10.1007/3-540-44645-1_25
    91 rdf:type schema:CreativeWork
    92 sg:pub.10.1007/3-540-45691-0_24 schema:sameAs https://app.dimensions.ai/details/publication/pub.1020330513
    93 https://doi.org/10.1007/3-540-45691-0_24
    94 rdf:type schema:CreativeWork
    95 sg:pub.10.1007/3-540-45691-0_25 schema:sameAs https://app.dimensions.ai/details/publication/pub.1026635841
    96 https://doi.org/10.1007/3-540-45691-0_25
    97 rdf:type schema:CreativeWork
    98 sg:pub.10.1007/3-540-45691-0_28 schema:sameAs https://app.dimensions.ai/details/publication/pub.1043771341
    99 https://doi.org/10.1007/3-540-45691-0_28
    100 rdf:type schema:CreativeWork
    101 https://doi.org/10.1002/(sici)1097-4571(199101)42:1<7::aid-asi2>3.0.co;2-p schema:sameAs https://app.dimensions.ai/details/publication/pub.1034971043
    102 rdf:type schema:CreativeWork
    103 https://doi.org/10.1002/(sici)1097-4571(199206)43:5<384::aid-asi6>3.0.co;2-l schema:sameAs https://app.dimensions.ai/details/publication/pub.1018240749
    104 rdf:type schema:CreativeWork
    105 https://doi.org/10.1002/(sici)1097-4571(199601)47:1<70::aid-asi7>3.0.co;2-# schema:sameAs https://app.dimensions.ai/details/publication/pub.1006353423
    106 rdf:type schema:CreativeWork
    107 https://doi.org/10.1002/(sici)1097-4571(1999)50:10<944::aid-asi9>3.0.co;2-q schema:sameAs https://app.dimensions.ai/details/publication/pub.1024181092
    108 rdf:type schema:CreativeWork
    109 https://doi.org/10.1016/0306-4573(92)90005-k schema:sameAs https://app.dimensions.ai/details/publication/pub.1027935191
    110 rdf:type schema:CreativeWork
    111 https://doi.org/10.1016/s0306-4573(02)00018-3 schema:sameAs https://app.dimensions.ai/details/publication/pub.1009347350
    112 rdf:type schema:CreativeWork
    113 https://doi.org/10.1108/eb046814 schema:sameAs https://app.dimensions.ai/details/publication/pub.1037275209
    114 rdf:type schema:CreativeWork
    115 https://doi.org/10.1145/160688.160758 schema:sameAs https://app.dimensions.ai/details/publication/pub.1021980312
    116 rdf:type schema:CreativeWork
    117 https://doi.org/10.1145/243199.243209 schema:sameAs https://app.dimensions.ai/details/publication/pub.1010732909
    118 rdf:type schema:CreativeWork
    119 https://doi.org/10.1145/243199.243213 schema:sameAs https://app.dimensions.ai/details/publication/pub.1047493755
    120 rdf:type schema:CreativeWork
    121 https://doi.org/10.1162/089120101750300490 schema:sameAs https://app.dimensions.ai/details/publication/pub.1008516626
    122 rdf:type schema:CreativeWork
    123 https://www.grid.ac/institutes/grid.433769.c schema:alternateName Eurospider Information Technology (Switzerland)
    124 schema:name Eurospider Information Technology AG, Schaffhauserstrasse 18, CH-8006 Zürich, Switzerland
    125 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...