Similarity Measures for Short Segments of Text View Full Text


Ontology type: schema:Chapter      Open Access: True


Chapter Info

DATE

2007

AUTHORS

Donald Metzler , Susan Dumais , Christopher Meek

ABSTRACT

Measuring the similarity between documents and queries has been extensively studied in information retrieval. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. These tasks include query reformulation, sponsored search, and image retrieval. Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context. In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log. Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency. More... »

PAGES

16-27

Book

TITLE

Advances in Information Retrieval

ISBN

978-3-540-71494-1
978-3-540-71496-5

Author Affiliations

Identifiers

URI

http://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5

DOI

http://dx.doi.org/10.1007/978-3-540-71496-5_5

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1030086294


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Artificial Intelligence and Image Processing", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "name": [
            "University of Massachusetts, Amherst, MA,"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Metzler", 
        "givenName": "Donald", 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Microsoft (United States)", 
          "id": "https://www.grid.ac/institutes/grid.419815.0", 
          "name": [
            "Microsoft Research, Redmond, WA,"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Dumais", 
        "givenName": "Susan", 
        "id": "sg:person.014627200551.35", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014627200551.35"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Microsoft (United States)", 
          "id": "https://www.grid.ac/institutes/grid.419815.0", 
          "name": [
            "Microsoft Research, Redmond, WA,"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Meek", 
        "givenName": "Christopher", 
        "id": "sg:person.01352023432.48", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01352023432.48"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "https://doi.org/10.1145/1099554.1099695", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1002114215"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.3115/1220575.1220661", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1007502057"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/312624.312681", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1012032169"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1012153938"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/1135777.1135834", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1014998074"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/160688.160718", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1017494025"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/383952.384019", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1019714596"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/383952.383972", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1019923797"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/502585.502654", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1045668804"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/1135777.1135835", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1046305355"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2007", 
    "datePublishedReg": "2007-01-01", 
    "description": "Measuring the similarity between documents and queries has been extensively studied in information retrieval. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. These tasks include query reformulation, sponsored search, and image retrieval. Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context. In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log. Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency.", 
    "editor": [
      {
        "familyName": "Amati", 
        "givenName": "Giambattista", 
        "type": "Person"
      }, 
      {
        "familyName": "Carpineto", 
        "givenName": "Claudio", 
        "type": "Person"
      }, 
      {
        "familyName": "Romano", 
        "givenName": "Giovanni", 
        "type": "Person"
      }
    ], 
    "genre": "chapter", 
    "id": "sg:pub.10.1007/978-3-540-71496-5_5", 
    "inLanguage": [
      "en"
    ], 
    "isAccessibleForFree": true, 
    "isPartOf": {
      "isbn": [
        "978-3-540-71494-1", 
        "978-3-540-71496-5"
      ], 
      "name": "Advances in Information Retrieval", 
      "type": "Book"
    }, 
    "name": "Similarity Measures for Short Segments of Text", 
    "pagination": "16-27", 
    "productId": [
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1007/978-3-540-71496-5_5"
        ]
      }, 
      {
        "name": "readcube_id", 
        "type": "PropertyValue", 
        "value": [
          "8f5404261a02e4cbbeedb044420f92181f8d100d32dc3015335c37f4f267b435"
        ]
      }, 
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1030086294"
        ]
      }
    ], 
    "publisher": {
      "location": "Berlin, Heidelberg", 
      "name": "Springer Berlin Heidelberg", 
      "type": "Organisation"
    }, 
    "sameAs": [
      "https://doi.org/10.1007/978-3-540-71496-5_5", 
      "https://app.dimensions.ai/details/publication/pub.1030086294"
    ], 
    "sdDataset": "chapters", 
    "sdDatePublished": "2019-04-15T17:14", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8678_00000261.jsonl", 
    "type": "Chapter", 
    "url": "http://link.springer.com/10.1007/978-3-540-71496-5_5"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5'


 

This table displays all metadata directly associated to this object as RDF triples.

120 TRIPLES      23 PREDICATES      37 URIs      20 LITERALS      8 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1007/978-3-540-71496-5_5 schema:about anzsrc-for:08
2 anzsrc-for:0801
3 schema:author N6aab1ebaaece48768385c211c8d69b36
4 schema:citation https://doi.org/10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9
5 https://doi.org/10.1145/1099554.1099695
6 https://doi.org/10.1145/1135777.1135834
7 https://doi.org/10.1145/1135777.1135835
8 https://doi.org/10.1145/160688.160718
9 https://doi.org/10.1145/312624.312681
10 https://doi.org/10.1145/383952.383972
11 https://doi.org/10.1145/383952.384019
12 https://doi.org/10.1145/502585.502654
13 https://doi.org/10.3115/1220575.1220661
14 schema:datePublished 2007
15 schema:datePublishedReg 2007-01-01
16 schema:description Measuring the similarity between documents and queries has been extensively studied in information retrieval. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. These tasks include query reformulation, sponsored search, and image retrieval. Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context. In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log. Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency.
17 schema:editor N95ca88f0a6934cc087bcb1f4eee47efe
18 schema:genre chapter
19 schema:inLanguage en
20 schema:isAccessibleForFree true
21 schema:isPartOf Na9267f894fcf43988350dd697ae44171
22 schema:name Similarity Measures for Short Segments of Text
23 schema:pagination 16-27
24 schema:productId N0eabc79fe0d94346bac2df8d021af314
25 N1221b29651d0405b971c6a66951a18f3
26 N4a1da639105e4af1ae561261dc36a18c
27 schema:publisher Na2584cffbdd840ffbf040b10e94a7da7
28 schema:sameAs https://app.dimensions.ai/details/publication/pub.1030086294
29 https://doi.org/10.1007/978-3-540-71496-5_5
30 schema:sdDatePublished 2019-04-15T17:14
31 schema:sdLicense https://scigraph.springernature.com/explorer/license/
32 schema:sdPublisher Nbb84065c11544f049063d25298fdaff4
33 schema:url http://link.springer.com/10.1007/978-3-540-71496-5_5
34 sgo:license sg:explorer/license/
35 sgo:sdDataset chapters
36 rdf:type schema:Chapter
37 N00f1d6f313ed495ba75af2d37a8dfcc3 schema:name University of Massachusetts, Amherst, MA,
38 rdf:type schema:Organization
39 N0eabc79fe0d94346bac2df8d021af314 schema:name dimensions_id
40 schema:value pub.1030086294
41 rdf:type schema:PropertyValue
42 N1221b29651d0405b971c6a66951a18f3 schema:name doi
43 schema:value 10.1007/978-3-540-71496-5_5
44 rdf:type schema:PropertyValue
45 N21354dcaf355481296933c9d76962b95 schema:familyName Romano
46 schema:givenName Giovanni
47 rdf:type schema:Person
48 N340e2640bc0c49439f7f8deeb7f01d99 schema:affiliation N00f1d6f313ed495ba75af2d37a8dfcc3
49 schema:familyName Metzler
50 schema:givenName Donald
51 rdf:type schema:Person
52 N4a1da639105e4af1ae561261dc36a18c schema:name readcube_id
53 schema:value 8f5404261a02e4cbbeedb044420f92181f8d100d32dc3015335c37f4f267b435
54 rdf:type schema:PropertyValue
55 N6aab1ebaaece48768385c211c8d69b36 rdf:first N340e2640bc0c49439f7f8deeb7f01d99
56 rdf:rest N9bb0717d97b949989aae2e8017582208
57 N95ca88f0a6934cc087bcb1f4eee47efe rdf:first N96de804c24e64bdda29a2a2f81cac5f8
58 rdf:rest Ndbf8c09ae6d141d1b89d056cf4f6f5ff
59 N96de804c24e64bdda29a2a2f81cac5f8 schema:familyName Amati
60 schema:givenName Giambattista
61 rdf:type schema:Person
62 N9bb0717d97b949989aae2e8017582208 rdf:first sg:person.014627200551.35
63 rdf:rest Nb726a756b91f4fb0a2923f30253b57d3
64 Na2584cffbdd840ffbf040b10e94a7da7 schema:location Berlin, Heidelberg
65 schema:name Springer Berlin Heidelberg
66 rdf:type schema:Organisation
67 Na389cb0af01a48279f59d876e8fed20d rdf:first N21354dcaf355481296933c9d76962b95
68 rdf:rest rdf:nil
69 Na9267f894fcf43988350dd697ae44171 schema:isbn 978-3-540-71494-1
70 978-3-540-71496-5
71 schema:name Advances in Information Retrieval
72 rdf:type schema:Book
73 Nb726a756b91f4fb0a2923f30253b57d3 rdf:first sg:person.01352023432.48
74 rdf:rest rdf:nil
75 Nbb84065c11544f049063d25298fdaff4 schema:name Springer Nature - SN SciGraph project
76 rdf:type schema:Organization
77 Nc01e59942f474d5ea99338d3653f5991 schema:familyName Carpineto
78 schema:givenName Claudio
79 rdf:type schema:Person
80 Ndbf8c09ae6d141d1b89d056cf4f6f5ff rdf:first Nc01e59942f474d5ea99338d3653f5991
81 rdf:rest Na389cb0af01a48279f59d876e8fed20d
82 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
83 schema:name Information and Computing Sciences
84 rdf:type schema:DefinedTerm
85 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
86 schema:name Artificial Intelligence and Image Processing
87 rdf:type schema:DefinedTerm
88 sg:person.01352023432.48 schema:affiliation https://www.grid.ac/institutes/grid.419815.0
89 schema:familyName Meek
90 schema:givenName Christopher
91 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01352023432.48
92 rdf:type schema:Person
93 sg:person.014627200551.35 schema:affiliation https://www.grid.ac/institutes/grid.419815.0
94 schema:familyName Dumais
95 schema:givenName Susan
96 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014627200551.35
97 rdf:type schema:Person
98 https://doi.org/10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9 schema:sameAs https://app.dimensions.ai/details/publication/pub.1012153938
99 rdf:type schema:CreativeWork
100 https://doi.org/10.1145/1099554.1099695 schema:sameAs https://app.dimensions.ai/details/publication/pub.1002114215
101 rdf:type schema:CreativeWork
102 https://doi.org/10.1145/1135777.1135834 schema:sameAs https://app.dimensions.ai/details/publication/pub.1014998074
103 rdf:type schema:CreativeWork
104 https://doi.org/10.1145/1135777.1135835 schema:sameAs https://app.dimensions.ai/details/publication/pub.1046305355
105 rdf:type schema:CreativeWork
106 https://doi.org/10.1145/160688.160718 schema:sameAs https://app.dimensions.ai/details/publication/pub.1017494025
107 rdf:type schema:CreativeWork
108 https://doi.org/10.1145/312624.312681 schema:sameAs https://app.dimensions.ai/details/publication/pub.1012032169
109 rdf:type schema:CreativeWork
110 https://doi.org/10.1145/383952.383972 schema:sameAs https://app.dimensions.ai/details/publication/pub.1019923797
111 rdf:type schema:CreativeWork
112 https://doi.org/10.1145/383952.384019 schema:sameAs https://app.dimensions.ai/details/publication/pub.1019714596
113 rdf:type schema:CreativeWork
114 https://doi.org/10.1145/502585.502654 schema:sameAs https://app.dimensions.ai/details/publication/pub.1045668804
115 rdf:type schema:CreativeWork
116 https://doi.org/10.3115/1220575.1220661 schema:sameAs https://app.dimensions.ai/details/publication/pub.1007502057
117 rdf:type schema:CreativeWork
118 https://www.grid.ac/institutes/grid.419815.0 schema:alternateName Microsoft (United States)
119 schema:name Microsoft Research, Redmond, WA,
120 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...