Similarity Measures for Short Segments of Text View Full Text


Ontology type: schema:Chapter      Open Access: True


Chapter Info

DATE

2007

AUTHORS

Donald Metzler , Susan Dumais , Christopher Meek

ABSTRACT

Measuring the similarity between documents and queries has been extensively studied in information retrieval. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. These tasks include query reformulation, sponsored search, and image retrieval. Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context. In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log. Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency. More... »

PAGES

16-27

Book

TITLE

Advances in Information Retrieval

ISBN

978-3-540-71494-1
978-3-540-71496-5

Author Affiliations

Identifiers

URI

http://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5

DOI

http://dx.doi.org/10.1007/978-3-540-71496-5_5

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1030086294


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Artificial Intelligence and Image Processing", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "name": [
            "University of Massachusetts, Amherst, MA,"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Metzler", 
        "givenName": "Donald", 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Microsoft (United States)", 
          "id": "https://www.grid.ac/institutes/grid.419815.0", 
          "name": [
            "Microsoft Research, Redmond, WA,"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Dumais", 
        "givenName": "Susan", 
        "id": "sg:person.014627200551.35", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014627200551.35"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Microsoft (United States)", 
          "id": "https://www.grid.ac/institutes/grid.419815.0", 
          "name": [
            "Microsoft Research, Redmond, WA,"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Meek", 
        "givenName": "Christopher", 
        "id": "sg:person.01352023432.48", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01352023432.48"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "https://doi.org/10.1145/1099554.1099695", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1002114215"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.3115/1220575.1220661", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1007502057"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/312624.312681", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1012032169"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1012153938"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/1135777.1135834", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1014998074"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/160688.160718", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1017494025"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/383952.384019", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1019714596"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/383952.383972", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1019923797"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/502585.502654", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1045668804"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/1135777.1135835", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1046305355"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2007", 
    "datePublishedReg": "2007-01-01", 
    "description": "Measuring the similarity between documents and queries has been extensively studied in information retrieval. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. These tasks include query reformulation, sponsored search, and image retrieval. Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context. In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log. Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency.", 
    "editor": [
      {
        "familyName": "Amati", 
        "givenName": "Giambattista", 
        "type": "Person"
      }, 
      {
        "familyName": "Carpineto", 
        "givenName": "Claudio", 
        "type": "Person"
      }, 
      {
        "familyName": "Romano", 
        "givenName": "Giovanni", 
        "type": "Person"
      }
    ], 
    "genre": "chapter", 
    "id": "sg:pub.10.1007/978-3-540-71496-5_5", 
    "inLanguage": [
      "en"
    ], 
    "isAccessibleForFree": true, 
    "isPartOf": {
      "isbn": [
        "978-3-540-71494-1", 
        "978-3-540-71496-5"
      ], 
      "name": "Advances in Information Retrieval", 
      "type": "Book"
    }, 
    "name": "Similarity Measures for Short Segments of Text", 
    "pagination": "16-27", 
    "productId": [
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1007/978-3-540-71496-5_5"
        ]
      }, 
      {
        "name": "readcube_id", 
        "type": "PropertyValue", 
        "value": [
          "8f5404261a02e4cbbeedb044420f92181f8d100d32dc3015335c37f4f267b435"
        ]
      }, 
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1030086294"
        ]
      }
    ], 
    "publisher": {
      "location": "Berlin, Heidelberg", 
      "name": "Springer Berlin Heidelberg", 
      "type": "Organisation"
    }, 
    "sameAs": [
      "https://doi.org/10.1007/978-3-540-71496-5_5", 
      "https://app.dimensions.ai/details/publication/pub.1030086294"
    ], 
    "sdDataset": "chapters", 
    "sdDatePublished": "2019-04-15T17:14", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8678_00000261.jsonl", 
    "type": "Chapter", 
    "url": "http://link.springer.com/10.1007/978-3-540-71496-5_5"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-71496-5_5'


 

This table displays all metadata directly associated to this object as RDF triples.

120 TRIPLES      23 PREDICATES      37 URIs      20 LITERALS      8 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1007/978-3-540-71496-5_5 schema:about anzsrc-for:08
2 anzsrc-for:0801
3 schema:author Nd3dadbc8a55f40afa7e4dd4568e7bce9
4 schema:citation https://doi.org/10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9
5 https://doi.org/10.1145/1099554.1099695
6 https://doi.org/10.1145/1135777.1135834
7 https://doi.org/10.1145/1135777.1135835
8 https://doi.org/10.1145/160688.160718
9 https://doi.org/10.1145/312624.312681
10 https://doi.org/10.1145/383952.383972
11 https://doi.org/10.1145/383952.384019
12 https://doi.org/10.1145/502585.502654
13 https://doi.org/10.3115/1220575.1220661
14 schema:datePublished 2007
15 schema:datePublishedReg 2007-01-01
16 schema:description Measuring the similarity between documents and queries has been extensively studied in information retrieval. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. These tasks include query reformulation, sponsored search, and image retrieval. Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context. In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log. Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency.
17 schema:editor N13130548a71e4cc0b6b1a9c5c6662c3b
18 schema:genre chapter
19 schema:inLanguage en
20 schema:isAccessibleForFree true
21 schema:isPartOf N8caa8160f08446c7919583c9e92ef3bd
22 schema:name Similarity Measures for Short Segments of Text
23 schema:pagination 16-27
24 schema:productId N1c8191c7a59b4d419b4b1aa15b481c20
25 N6718572ae7494cbeb804c62d3ad9fbb9
26 Ne2fd9feceb234b3d8a75a229daa0b6a6
27 schema:publisher N35112ea976624294a15c72e8700e8d69
28 schema:sameAs https://app.dimensions.ai/details/publication/pub.1030086294
29 https://doi.org/10.1007/978-3-540-71496-5_5
30 schema:sdDatePublished 2019-04-15T17:14
31 schema:sdLicense https://scigraph.springernature.com/explorer/license/
32 schema:sdPublisher N89a2136a5c544a4eafd2dbb230538b1f
33 schema:url http://link.springer.com/10.1007/978-3-540-71496-5_5
34 sgo:license sg:explorer/license/
35 sgo:sdDataset chapters
36 rdf:type schema:Chapter
37 N107a9369ccf543aca62e85be1ae18379 schema:familyName Amati
38 schema:givenName Giambattista
39 rdf:type schema:Person
40 N13130548a71e4cc0b6b1a9c5c6662c3b rdf:first N107a9369ccf543aca62e85be1ae18379
41 rdf:rest N6e5f7113c2fc4f86a588fcc796d3c819
42 N1c8191c7a59b4d419b4b1aa15b481c20 schema:name readcube_id
43 schema:value 8f5404261a02e4cbbeedb044420f92181f8d100d32dc3015335c37f4f267b435
44 rdf:type schema:PropertyValue
45 N2983c68dc3c54380972fa21c1016d499 rdf:first sg:person.014627200551.35
46 rdf:rest N8d4509959a1442e2a69d60e52b343022
47 N35112ea976624294a15c72e8700e8d69 schema:location Berlin, Heidelberg
48 schema:name Springer Berlin Heidelberg
49 rdf:type schema:Organisation
50 N3d2707e75ab343b6afb25bbc19e46cfd schema:affiliation Nfdf527e7eb0a46709adc0a9b81d022f5
51 schema:familyName Metzler
52 schema:givenName Donald
53 rdf:type schema:Person
54 N61ef55ad775a47618427a3700f9c138f schema:familyName Romano
55 schema:givenName Giovanni
56 rdf:type schema:Person
57 N6718572ae7494cbeb804c62d3ad9fbb9 schema:name dimensions_id
58 schema:value pub.1030086294
59 rdf:type schema:PropertyValue
60 N6e5f7113c2fc4f86a588fcc796d3c819 rdf:first Nb0553d574ad74909bb1e2fcf0c230679
61 rdf:rest Nbffa067effa14668996dc6ebb323ea99
62 N89a2136a5c544a4eafd2dbb230538b1f schema:name Springer Nature - SN SciGraph project
63 rdf:type schema:Organization
64 N8caa8160f08446c7919583c9e92ef3bd schema:isbn 978-3-540-71494-1
65 978-3-540-71496-5
66 schema:name Advances in Information Retrieval
67 rdf:type schema:Book
68 N8d4509959a1442e2a69d60e52b343022 rdf:first sg:person.01352023432.48
69 rdf:rest rdf:nil
70 Nb0553d574ad74909bb1e2fcf0c230679 schema:familyName Carpineto
71 schema:givenName Claudio
72 rdf:type schema:Person
73 Nbffa067effa14668996dc6ebb323ea99 rdf:first N61ef55ad775a47618427a3700f9c138f
74 rdf:rest rdf:nil
75 Nd3dadbc8a55f40afa7e4dd4568e7bce9 rdf:first N3d2707e75ab343b6afb25bbc19e46cfd
76 rdf:rest N2983c68dc3c54380972fa21c1016d499
77 Ne2fd9feceb234b3d8a75a229daa0b6a6 schema:name doi
78 schema:value 10.1007/978-3-540-71496-5_5
79 rdf:type schema:PropertyValue
80 Nfdf527e7eb0a46709adc0a9b81d022f5 schema:name University of Massachusetts, Amherst, MA,
81 rdf:type schema:Organization
82 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
83 schema:name Information and Computing Sciences
84 rdf:type schema:DefinedTerm
85 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
86 schema:name Artificial Intelligence and Image Processing
87 rdf:type schema:DefinedTerm
88 sg:person.01352023432.48 schema:affiliation https://www.grid.ac/institutes/grid.419815.0
89 schema:familyName Meek
90 schema:givenName Christopher
91 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01352023432.48
92 rdf:type schema:Person
93 sg:person.014627200551.35 schema:affiliation https://www.grid.ac/institutes/grid.419815.0
94 schema:familyName Dumais
95 schema:givenName Susan
96 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014627200551.35
97 rdf:type schema:Person
98 https://doi.org/10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9 schema:sameAs https://app.dimensions.ai/details/publication/pub.1012153938
99 rdf:type schema:CreativeWork
100 https://doi.org/10.1145/1099554.1099695 schema:sameAs https://app.dimensions.ai/details/publication/pub.1002114215
101 rdf:type schema:CreativeWork
102 https://doi.org/10.1145/1135777.1135834 schema:sameAs https://app.dimensions.ai/details/publication/pub.1014998074
103 rdf:type schema:CreativeWork
104 https://doi.org/10.1145/1135777.1135835 schema:sameAs https://app.dimensions.ai/details/publication/pub.1046305355
105 rdf:type schema:CreativeWork
106 https://doi.org/10.1145/160688.160718 schema:sameAs https://app.dimensions.ai/details/publication/pub.1017494025
107 rdf:type schema:CreativeWork
108 https://doi.org/10.1145/312624.312681 schema:sameAs https://app.dimensions.ai/details/publication/pub.1012032169
109 rdf:type schema:CreativeWork
110 https://doi.org/10.1145/383952.383972 schema:sameAs https://app.dimensions.ai/details/publication/pub.1019923797
111 rdf:type schema:CreativeWork
112 https://doi.org/10.1145/383952.384019 schema:sameAs https://app.dimensions.ai/details/publication/pub.1019714596
113 rdf:type schema:CreativeWork
114 https://doi.org/10.1145/502585.502654 schema:sameAs https://app.dimensions.ai/details/publication/pub.1045668804
115 rdf:type schema:CreativeWork
116 https://doi.org/10.3115/1220575.1220661 schema:sameAs https://app.dimensions.ai/details/publication/pub.1007502057
117 rdf:type schema:CreativeWork
118 https://www.grid.ac/institutes/grid.419815.0 schema:alternateName Microsoft (United States)
119 schema:name Microsoft Research, Redmond, WA,
120 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...