Similarity Measures for Short Segments of Text View Full Text


Ontology type: schema:Chapter      Open Access: True


Chapter Info

DATE

2007

AUTHORS

Donald Metzler , Susan Dumais , Christopher Meek

ABSTRACT

Measuring the similarity between documents and queries has been extensively studied in information retrieval. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. These tasks include query reformulation, sponsored search, and image retrieval. Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context. In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log. Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency. More... »

PAGES

16-27

Book

TITLE

Advances in Information Retrieval

ISBN

978-3-540-71494-1
978-3-540-71496-5

Author Affiliations

Identifiers

URI

http://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5

DOI

http://dx.doi.org/10.1007/978-3-540-71496-5_5

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1030086294


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Artificial Intelligence and Image Processing", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "name": [
            "University of Massachusetts, Amherst, MA,"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Metzler", 
        "givenName": "Donald", 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Microsoft (United States)", 
          "id": "https://www.grid.ac/institutes/grid.419815.0", 
          "name": [
            "Microsoft Research, Redmond, WA,"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Dumais", 
        "givenName": "Susan", 
        "id": "sg:person.014627200551.35", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014627200551.35"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Microsoft (United States)", 
          "id": "https://www.grid.ac/institutes/grid.419815.0", 
          "name": [
            "Microsoft Research, Redmond, WA,"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Meek", 
        "givenName": "Christopher", 
        "id": "sg:person.01352023432.48", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01352023432.48"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "https://doi.org/10.1145/1099554.1099695", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1002114215"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.3115/1220575.1220661", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1007502057"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/312624.312681", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1012032169"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1012153938"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/1135777.1135834", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1014998074"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/160688.160718", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1017494025"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/383952.384019", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1019714596"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/383952.383972", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1019923797"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/502585.502654", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1045668804"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/1135777.1135835", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1046305355"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2007", 
    "datePublishedReg": "2007-01-01", 
    "description": "Measuring the similarity between documents and queries has been extensively studied in information retrieval. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. These tasks include query reformulation, sponsored search, and image retrieval. Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context. In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log. Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency.", 
    "editor": [
      {
        "familyName": "Amati", 
        "givenName": "Giambattista", 
        "type": "Person"
      }, 
      {
        "familyName": "Carpineto", 
        "givenName": "Claudio", 
        "type": "Person"
      }, 
      {
        "familyName": "Romano", 
        "givenName": "Giovanni", 
        "type": "Person"
      }
    ], 
    "genre": "chapter", 
    "id": "sg:pub.10.1007/978-3-540-71496-5_5", 
    "inLanguage": [
      "en"
    ], 
    "isAccessibleForFree": true, 
    "isPartOf": {
      "isbn": [
        "978-3-540-71494-1", 
        "978-3-540-71496-5"
      ], 
      "name": "Advances in Information Retrieval", 
      "type": "Book"
    }, 
    "name": "Similarity Measures for Short Segments of Text", 
    "pagination": "16-27", 
    "productId": [
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1007/978-3-540-71496-5_5"
        ]
      }, 
      {
        "name": "readcube_id", 
        "type": "PropertyValue", 
        "value": [
          "8f5404261a02e4cbbeedb044420f92181f8d100d32dc3015335c37f4f267b435"
        ]
      }, 
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1030086294"
        ]
      }
    ], 
    "publisher": {
      "location": "Berlin, Heidelberg", 
      "name": "Springer Berlin Heidelberg", 
      "type": "Organisation"
    }, 
    "sameAs": [
      "https://doi.org/10.1007/978-3-540-71496-5_5", 
      "https://app.dimensions.ai/details/publication/pub.1030086294"
    ], 
    "sdDataset": "chapters", 
    "sdDatePublished": "2019-04-15T17:14", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8678_00000261.jsonl", 
    "type": "Chapter", 
    "url": "http://link.springer.com/10.1007/978-3-540-71496-5_5"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5'


 

This table displays all metadata directly associated to this object as RDF triples.

120 TRIPLES      23 PREDICATES      37 URIs      20 LITERALS      8 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1007/978-3-540-71496-5_5 schema:about anzsrc-for:08
2 anzsrc-for:0801
3 schema:author Nb067935de031476189c4183e58e56507
4 schema:citation https://doi.org/10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9
5 https://doi.org/10.1145/1099554.1099695
6 https://doi.org/10.1145/1135777.1135834
7 https://doi.org/10.1145/1135777.1135835
8 https://doi.org/10.1145/160688.160718
9 https://doi.org/10.1145/312624.312681
10 https://doi.org/10.1145/383952.383972
11 https://doi.org/10.1145/383952.384019
12 https://doi.org/10.1145/502585.502654
13 https://doi.org/10.3115/1220575.1220661
14 schema:datePublished 2007
15 schema:datePublishedReg 2007-01-01
16 schema:description Measuring the similarity between documents and queries has been extensively studied in information retrieval. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. These tasks include query reformulation, sponsored search, and image retrieval. Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context. In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log. Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency.
17 schema:editor N8a5e7a88c5394cd48cf09423ed574ca4
18 schema:genre chapter
19 schema:inLanguage en
20 schema:isAccessibleForFree true
21 schema:isPartOf N434c6a5ad2ce4f6f8327359122b0bc6f
22 schema:name Similarity Measures for Short Segments of Text
23 schema:pagination 16-27
24 schema:productId N45032e3731eb481fb7e84082d3f7bb84
25 N91186177a34d49edb1304a9adc6013bc
26 Ne192e39233f144fe8fc36019706761b8
27 schema:publisher N7c71a8c8f2d14ea5828fb4b923e84624
28 schema:sameAs https://app.dimensions.ai/details/publication/pub.1030086294
29 https://doi.org/10.1007/978-3-540-71496-5_5
30 schema:sdDatePublished 2019-04-15T17:14
31 schema:sdLicense https://scigraph.springernature.com/explorer/license/
32 schema:sdPublisher N87c3cb31a8bd4fba9e7d6a4543f94b51
33 schema:url http://link.springer.com/10.1007/978-3-540-71496-5_5
34 sgo:license sg:explorer/license/
35 sgo:sdDataset chapters
36 rdf:type schema:Chapter
37 N434c6a5ad2ce4f6f8327359122b0bc6f schema:isbn 978-3-540-71494-1
38 978-3-540-71496-5
39 schema:name Advances in Information Retrieval
40 rdf:type schema:Book
41 N45032e3731eb481fb7e84082d3f7bb84 schema:name doi
42 schema:value 10.1007/978-3-540-71496-5_5
43 rdf:type schema:PropertyValue
44 N56b68d4bd0814c56940f28b4e1d29548 schema:familyName Carpineto
45 schema:givenName Claudio
46 rdf:type schema:Person
47 N651a68af63384f22bbeead99d590ee8f rdf:first sg:person.014627200551.35
48 rdf:rest N98d0ff0122704b2abd7ae1128017a2d5
49 N7c71a8c8f2d14ea5828fb4b923e84624 schema:location Berlin, Heidelberg
50 schema:name Springer Berlin Heidelberg
51 rdf:type schema:Organisation
52 N87c3cb31a8bd4fba9e7d6a4543f94b51 schema:name Springer Nature - SN SciGraph project
53 rdf:type schema:Organization
54 N8a5e7a88c5394cd48cf09423ed574ca4 rdf:first N90a7799d4e314bba8588993aa25d473d
55 rdf:rest Nb08da2b2b1054a3bbf163882f9342a8c
56 N90a7799d4e314bba8588993aa25d473d schema:familyName Amati
57 schema:givenName Giambattista
58 rdf:type schema:Person
59 N91186177a34d49edb1304a9adc6013bc schema:name readcube_id
60 schema:value 8f5404261a02e4cbbeedb044420f92181f8d100d32dc3015335c37f4f267b435
61 rdf:type schema:PropertyValue
62 N98d0ff0122704b2abd7ae1128017a2d5 rdf:first sg:person.01352023432.48
63 rdf:rest rdf:nil
64 Na159cb244673409cb397b0f29fd5a6b7 rdf:first Nac681e334daf40efaf9d5c77d0ec4f38
65 rdf:rest rdf:nil
66 Nac681e334daf40efaf9d5c77d0ec4f38 schema:familyName Romano
67 schema:givenName Giovanni
68 rdf:type schema:Person
69 Nb067935de031476189c4183e58e56507 rdf:first Nda68af6bbbc148cd8144e6982a07fd7d
70 rdf:rest N651a68af63384f22bbeead99d590ee8f
71 Nb08da2b2b1054a3bbf163882f9342a8c rdf:first N56b68d4bd0814c56940f28b4e1d29548
72 rdf:rest Na159cb244673409cb397b0f29fd5a6b7
73 Nda68af6bbbc148cd8144e6982a07fd7d schema:affiliation Nf0517749b0074b83b3fef977cdafa142
74 schema:familyName Metzler
75 schema:givenName Donald
76 rdf:type schema:Person
77 Ne192e39233f144fe8fc36019706761b8 schema:name dimensions_id
78 schema:value pub.1030086294
79 rdf:type schema:PropertyValue
80 Nf0517749b0074b83b3fef977cdafa142 schema:name University of Massachusetts, Amherst, MA,
81 rdf:type schema:Organization
82 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
83 schema:name Information and Computing Sciences
84 rdf:type schema:DefinedTerm
85 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
86 schema:name Artificial Intelligence and Image Processing
87 rdf:type schema:DefinedTerm
88 sg:person.01352023432.48 schema:affiliation https://www.grid.ac/institutes/grid.419815.0
89 schema:familyName Meek
90 schema:givenName Christopher
91 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01352023432.48
92 rdf:type schema:Person
93 sg:person.014627200551.35 schema:affiliation https://www.grid.ac/institutes/grid.419815.0
94 schema:familyName Dumais
95 schema:givenName Susan
96 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014627200551.35
97 rdf:type schema:Person
98 https://doi.org/10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9 schema:sameAs https://app.dimensions.ai/details/publication/pub.1012153938
99 rdf:type schema:CreativeWork
100 https://doi.org/10.1145/1099554.1099695 schema:sameAs https://app.dimensions.ai/details/publication/pub.1002114215
101 rdf:type schema:CreativeWork
102 https://doi.org/10.1145/1135777.1135834 schema:sameAs https://app.dimensions.ai/details/publication/pub.1014998074
103 rdf:type schema:CreativeWork
104 https://doi.org/10.1145/1135777.1135835 schema:sameAs https://app.dimensions.ai/details/publication/pub.1046305355
105 rdf:type schema:CreativeWork
106 https://doi.org/10.1145/160688.160718 schema:sameAs https://app.dimensions.ai/details/publication/pub.1017494025
107 rdf:type schema:CreativeWork
108 https://doi.org/10.1145/312624.312681 schema:sameAs https://app.dimensions.ai/details/publication/pub.1012032169
109 rdf:type schema:CreativeWork
110 https://doi.org/10.1145/383952.383972 schema:sameAs https://app.dimensions.ai/details/publication/pub.1019923797
111 rdf:type schema:CreativeWork
112 https://doi.org/10.1145/383952.384019 schema:sameAs https://app.dimensions.ai/details/publication/pub.1019714596
113 rdf:type schema:CreativeWork
114 https://doi.org/10.1145/502585.502654 schema:sameAs https://app.dimensions.ai/details/publication/pub.1045668804
115 rdf:type schema:CreativeWork
116 https://doi.org/10.3115/1220575.1220661 schema:sameAs https://app.dimensions.ai/details/publication/pub.1007502057
117 rdf:type schema:CreativeWork
118 https://www.grid.ac/institutes/grid.419815.0 schema:alternateName Microsoft (United States)
119 schema:name Microsoft Research, Redmond, WA,
120 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...