New Semantic Similarity Based Model for Text Clustering Using Extended Gloss Overlaps View Full Text


Ontology type: schema:Chapter     


Chapter Info

DATE

2009

AUTHORS

Walaa K. Gad , Mohamed S. Kamel

ABSTRACT

Most text clustering techniques are based on words and/or phrases weights in the text. Such representation is often unsatisfactory because it ignores the relationships between terms, and considers them as independent features. In this paper, a new semantic similarity based model (SSBM) is proposed. The semantic similarity based model computes semantic similarities by utilizing WordNet as an ontology. The proposed model captures the semantic similarities between documents that contain semantically similar terms but unnecessarily syntactically identical. The semantic similarity based model assigns a new weight to document terms reflecting the semantic relationships between terms that co-occur literally in the document. Our model in conjunction with the extended gloss overlaps measure and the adapted Lesk algorithm solves ambiguity, synonymy problems that are not detected using traditional term frequency based text mining techniques. The proposed model is evaluated on the Reuters-21578 and the 20-Newsgroups text collections datasets. The performance is assessed in terms of the Fmeasure, Purity and Entropy quality measures. The obtained results show promising performance improvements compared to the traditional term based vector space model (VSM) as well as other existing methods that include semantic similarity measures in text clustering. More... »

PAGES

663-677

References to SciGraph publications

  • 2003-04-30. Using Measures of Semantic Relatedness for Word Sense Disambiguation in COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING
  • 2002-02-05. An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet in COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING
  • Book

    TITLE

    Machine Learning and Data Mining in Pattern Recognition

    ISBN

    978-3-642-03069-7
    978-3-642-03070-3

    Author Affiliations

    Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/978-3-642-03070-3_50

    DOI

    http://dx.doi.org/10.1007/978-3-642-03070-3_50

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1014462304


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0806", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information Systems", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "University of Waterloo", 
              "id": "https://www.grid.ac/institutes/grid.46078.3d", 
              "name": [
                "Department of Electrical and Computer Engineering, University of Waterloo, N2L 3G1, Waterloo, Ontario, Canada"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Gad", 
            "givenName": "Walaa K.", 
            "id": "sg:person.012222465225.53", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012222465225.53"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "University of Waterloo", 
              "id": "https://www.grid.ac/institutes/grid.46078.3d", 
              "name": [
                "Department of Electrical and Computer Engineering, University of Waterloo, N2L 3G1, Waterloo, Ontario, Canada"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Kamel", 
            "givenName": "Mohamed S.", 
            "id": "sg:person.01133760566.26", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01133760566.26"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "https://doi.org/10.3115/981732.981751", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1000312622"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-36456-0_24", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1001746929", 
              "https://doi.org/10.1007/3-540-36456-0_24"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-36456-0_24", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1001746929", 
              "https://doi.org/10.1007/3-540-36456-0_24"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/1281192.1281260", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1009044209"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1037/0033-295x.84.4.327", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1010221410"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1002/int.20226", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1020364411"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/318723.318728", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1036455397"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1108/eb046814", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1037275209"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1162/coli.2006.32.1.13", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1042269921"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-45715-1_11", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1053488638", 
              "https://doi.org/10.1007/3-540-45715-1_11"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-45715-1_11", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1053488638", 
              "https://doi.org/10.1007/3-540-45715-1_11"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/tkde.2003.1209005", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1061661172"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/tkde.2004.58", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1061661327"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/hicss.2006.129", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1093645328"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.3115/1621445.1621458", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1099256398"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.7551/mitpress/7287.001.0001", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1110625185"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2009", 
        "datePublishedReg": "2009-01-01", 
        "description": "Most text clustering techniques are based on words and/or phrases weights in the text. Such representation is often unsatisfactory because it ignores the relationships between terms, and considers them as independent features. In this paper, a new semantic similarity based model (SSBM) is proposed. The semantic similarity based model computes semantic similarities by utilizing WordNet as an ontology. The proposed model captures the semantic similarities between documents that contain semantically similar terms but unnecessarily syntactically identical. The semantic similarity based model assigns a new weight to document terms reflecting the semantic relationships between terms that co-occur literally in the document. Our model in conjunction with the extended gloss overlaps measure and the adapted Lesk algorithm solves ambiguity, synonymy problems that are not detected using traditional term frequency based text mining techniques. The proposed model is evaluated on the Reuters-21578 and the 20-Newsgroups text collections datasets. The performance is assessed in terms of the Fmeasure, Purity and Entropy quality measures. The obtained results show promising performance improvements compared to the traditional term based vector space model (VSM) as well as other existing methods that include semantic similarity measures in text clustering.", 
        "editor": [
          {
            "familyName": "Perner", 
            "givenName": "Petra", 
            "type": "Person"
          }
        ], 
        "genre": "chapter", 
        "id": "sg:pub.10.1007/978-3-642-03070-3_50", 
        "inLanguage": [
          "en"
        ], 
        "isAccessibleForFree": false, 
        "isPartOf": {
          "isbn": [
            "978-3-642-03069-7", 
            "978-3-642-03070-3"
          ], 
          "name": "Machine Learning and Data Mining in Pattern Recognition", 
          "type": "Book"
        }, 
        "name": "New Semantic Similarity Based Model for Text Clustering Using Extended Gloss Overlaps", 
        "pagination": "663-677", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1014462304"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/978-3-642-03070-3_50"
            ]
          }, 
          {
            "name": "readcube_id", 
            "type": "PropertyValue", 
            "value": [
              "44e64eb4e5bd5b68c32303b27a94ce3a4d3ce3050a427f407f48a35688a5f140"
            ]
          }
        ], 
        "publisher": {
          "location": "Berlin, Heidelberg", 
          "name": "Springer Berlin Heidelberg", 
          "type": "Organisation"
        }, 
        "sameAs": [
          "https://doi.org/10.1007/978-3-642-03070-3_50", 
          "https://app.dimensions.ai/details/publication/pub.1014462304"
        ], 
        "sdDataset": "chapters", 
        "sdDatePublished": "2019-04-16T07:18", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000354_0000000354/records_11692_00000000.jsonl", 
        "type": "Chapter", 
        "url": "https://link.springer.com/10.1007%2F978-3-642-03070-3_50"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-3-642-03070-3_50'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-3-642-03070-3_50'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-3-642-03070-3_50'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-3-642-03070-3_50'


     

    This table displays all metadata directly associated to this object as RDF triples.

    116 TRIPLES      23 PREDICATES      41 URIs      20 LITERALS      8 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/978-3-642-03070-3_50 schema:about anzsrc-for:08
    2 anzsrc-for:0806
    3 schema:author Na1cf362c8c2842308bb2983c14e5746c
    4 schema:citation sg:pub.10.1007/3-540-36456-0_24
    5 sg:pub.10.1007/3-540-45715-1_11
    6 https://doi.org/10.1002/int.20226
    7 https://doi.org/10.1037/0033-295x.84.4.327
    8 https://doi.org/10.1108/eb046814
    9 https://doi.org/10.1109/hicss.2006.129
    10 https://doi.org/10.1109/tkde.2003.1209005
    11 https://doi.org/10.1109/tkde.2004.58
    12 https://doi.org/10.1145/1281192.1281260
    13 https://doi.org/10.1145/318723.318728
    14 https://doi.org/10.1162/coli.2006.32.1.13
    15 https://doi.org/10.3115/1621445.1621458
    16 https://doi.org/10.3115/981732.981751
    17 https://doi.org/10.7551/mitpress/7287.001.0001
    18 schema:datePublished 2009
    19 schema:datePublishedReg 2009-01-01
    20 schema:description Most text clustering techniques are based on words and/or phrases weights in the text. Such representation is often unsatisfactory because it ignores the relationships between terms, and considers them as independent features. In this paper, a new semantic similarity based model (SSBM) is proposed. The semantic similarity based model computes semantic similarities by utilizing WordNet as an ontology. The proposed model captures the semantic similarities between documents that contain semantically similar terms but unnecessarily syntactically identical. The semantic similarity based model assigns a new weight to document terms reflecting the semantic relationships between terms that co-occur literally in the document. Our model in conjunction with the extended gloss overlaps measure and the adapted Lesk algorithm solves ambiguity, synonymy problems that are not detected using traditional term frequency based text mining techniques. The proposed model is evaluated on the Reuters-21578 and the 20-Newsgroups text collections datasets. The performance is assessed in terms of the Fmeasure, Purity and Entropy quality measures. The obtained results show promising performance improvements compared to the traditional term based vector space model (VSM) as well as other existing methods that include semantic similarity measures in text clustering.
    21 schema:editor N92cdbbd7e2eb4314bf1bea9bd60ad232
    22 schema:genre chapter
    23 schema:inLanguage en
    24 schema:isAccessibleForFree false
    25 schema:isPartOf N36b116d195fc48afb9b2ad2600a328aa
    26 schema:name New Semantic Similarity Based Model for Text Clustering Using Extended Gloss Overlaps
    27 schema:pagination 663-677
    28 schema:productId N14e2a0b037a540d98df73e2a79ebe643
    29 N71793eac92a64f4cbb85d7ac486eba38
    30 N83699a88452d45509aa138c6a2095c30
    31 schema:publisher N6dc5c5a5cb3d46028781888fe4cdab2c
    32 schema:sameAs https://app.dimensions.ai/details/publication/pub.1014462304
    33 https://doi.org/10.1007/978-3-642-03070-3_50
    34 schema:sdDatePublished 2019-04-16T07:18
    35 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    36 schema:sdPublisher Ne70c027fd0f14b189e4bb635adac4470
    37 schema:url https://link.springer.com/10.1007%2F978-3-642-03070-3_50
    38 sgo:license sg:explorer/license/
    39 sgo:sdDataset chapters
    40 rdf:type schema:Chapter
    41 N14e2a0b037a540d98df73e2a79ebe643 schema:name doi
    42 schema:value 10.1007/978-3-642-03070-3_50
    43 rdf:type schema:PropertyValue
    44 N36b116d195fc48afb9b2ad2600a328aa schema:isbn 978-3-642-03069-7
    45 978-3-642-03070-3
    46 schema:name Machine Learning and Data Mining in Pattern Recognition
    47 rdf:type schema:Book
    48 N6dc5c5a5cb3d46028781888fe4cdab2c schema:location Berlin, Heidelberg
    49 schema:name Springer Berlin Heidelberg
    50 rdf:type schema:Organisation
    51 N71793eac92a64f4cbb85d7ac486eba38 schema:name dimensions_id
    52 schema:value pub.1014462304
    53 rdf:type schema:PropertyValue
    54 N83699a88452d45509aa138c6a2095c30 schema:name readcube_id
    55 schema:value 44e64eb4e5bd5b68c32303b27a94ce3a4d3ce3050a427f407f48a35688a5f140
    56 rdf:type schema:PropertyValue
    57 N92cdbbd7e2eb4314bf1bea9bd60ad232 rdf:first Nb158461a685f4e83a9dd85b4709232f3
    58 rdf:rest rdf:nil
    59 Na1cf362c8c2842308bb2983c14e5746c rdf:first sg:person.012222465225.53
    60 rdf:rest Ndce2b543d0f64b618e6511144151e801
    61 Nb158461a685f4e83a9dd85b4709232f3 schema:familyName Perner
    62 schema:givenName Petra
    63 rdf:type schema:Person
    64 Ndce2b543d0f64b618e6511144151e801 rdf:first sg:person.01133760566.26
    65 rdf:rest rdf:nil
    66 Ne70c027fd0f14b189e4bb635adac4470 schema:name Springer Nature - SN SciGraph project
    67 rdf:type schema:Organization
    68 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    69 schema:name Information and Computing Sciences
    70 rdf:type schema:DefinedTerm
    71 anzsrc-for:0806 schema:inDefinedTermSet anzsrc-for:
    72 schema:name Information Systems
    73 rdf:type schema:DefinedTerm
    74 sg:person.01133760566.26 schema:affiliation https://www.grid.ac/institutes/grid.46078.3d
    75 schema:familyName Kamel
    76 schema:givenName Mohamed S.
    77 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01133760566.26
    78 rdf:type schema:Person
    79 sg:person.012222465225.53 schema:affiliation https://www.grid.ac/institutes/grid.46078.3d
    80 schema:familyName Gad
    81 schema:givenName Walaa K.
    82 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012222465225.53
    83 rdf:type schema:Person
    84 sg:pub.10.1007/3-540-36456-0_24 schema:sameAs https://app.dimensions.ai/details/publication/pub.1001746929
    85 https://doi.org/10.1007/3-540-36456-0_24
    86 rdf:type schema:CreativeWork
    87 sg:pub.10.1007/3-540-45715-1_11 schema:sameAs https://app.dimensions.ai/details/publication/pub.1053488638
    88 https://doi.org/10.1007/3-540-45715-1_11
    89 rdf:type schema:CreativeWork
    90 https://doi.org/10.1002/int.20226 schema:sameAs https://app.dimensions.ai/details/publication/pub.1020364411
    91 rdf:type schema:CreativeWork
    92 https://doi.org/10.1037/0033-295x.84.4.327 schema:sameAs https://app.dimensions.ai/details/publication/pub.1010221410
    93 rdf:type schema:CreativeWork
    94 https://doi.org/10.1108/eb046814 schema:sameAs https://app.dimensions.ai/details/publication/pub.1037275209
    95 rdf:type schema:CreativeWork
    96 https://doi.org/10.1109/hicss.2006.129 schema:sameAs https://app.dimensions.ai/details/publication/pub.1093645328
    97 rdf:type schema:CreativeWork
    98 https://doi.org/10.1109/tkde.2003.1209005 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061661172
    99 rdf:type schema:CreativeWork
    100 https://doi.org/10.1109/tkde.2004.58 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061661327
    101 rdf:type schema:CreativeWork
    102 https://doi.org/10.1145/1281192.1281260 schema:sameAs https://app.dimensions.ai/details/publication/pub.1009044209
    103 rdf:type schema:CreativeWork
    104 https://doi.org/10.1145/318723.318728 schema:sameAs https://app.dimensions.ai/details/publication/pub.1036455397
    105 rdf:type schema:CreativeWork
    106 https://doi.org/10.1162/coli.2006.32.1.13 schema:sameAs https://app.dimensions.ai/details/publication/pub.1042269921
    107 rdf:type schema:CreativeWork
    108 https://doi.org/10.3115/1621445.1621458 schema:sameAs https://app.dimensions.ai/details/publication/pub.1099256398
    109 rdf:type schema:CreativeWork
    110 https://doi.org/10.3115/981732.981751 schema:sameAs https://app.dimensions.ai/details/publication/pub.1000312622
    111 rdf:type schema:CreativeWork
    112 https://doi.org/10.7551/mitpress/7287.001.0001 schema:sameAs https://app.dimensions.ai/details/publication/pub.1110625185
    113 rdf:type schema:CreativeWork
    114 https://www.grid.ac/institutes/grid.46078.3d schema:alternateName University of Waterloo
    115 schema:name Department of Electrical and Computer Engineering, University of Waterloo, N2L 3G1, Waterloo, Ontario, Canada
    116 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...