Efficient Update of Indexes for Dynamically Changing Web Documents View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2007-03

AUTHORS

Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey Scott Vitter, Ramesh Agarwal

ABSTRACT

Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed. In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index. More... »

PAGES

37-69

Identifiers

URI

http://scigraph.springernature.com/pub.10.1007/s11280-006-0009-2

DOI

http://dx.doi.org/10.1007/s11280-006-0009-2

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1005001548


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Artificial Intelligence and Image Processing", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "IBM Research \u2013 Thomas J. Watson Research Center", 
          "id": "https://www.grid.ac/institutes/grid.481554.9", 
          "name": [
            "IBM T. J. Watson Research Ctr., 19 Skyline Dr., 10532, Hawthorne, NY, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Lim", 
        "givenName": "Lipyeow", 
        "id": "sg:person.015270452377.19", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015270452377.19"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "IBM Research \u2013 Thomas J. Watson Research Center", 
          "id": "https://www.grid.ac/institutes/grid.481554.9", 
          "name": [
            "IBM T. J. Watson Research Ctr., 19 Skyline Dr., 10532, Hawthorne, NY, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Wang", 
        "givenName": "Min", 
        "id": "sg:person.012657435165.75", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012657435165.75"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "name": [
            "IBM Silicon Valley Lab., 555 Bailey Av., 95141, San Jose, CA, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Padmanabhan", 
        "givenName": "Sriram", 
        "id": "sg:person.015175246447.02", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015175246447.02"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Purdue University", 
          "id": "https://www.grid.ac/institutes/grid.169077.e", 
          "name": [
            "Purdue University, 150 N. University St., 47907, West Lafayette, IN, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Vitter", 
        "givenName": "Jeffrey Scott", 
        "id": "sg:person.0613677314.28", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0613677314.28"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "IBM Research - Almaden", 
          "id": "https://www.grid.ac/institutes/grid.481551.c", 
          "name": [
            "IBM Almaden Research Ctr., 650 Harry Rd., 95120-6099, San Jose, CA, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Agarwal", 
        "givenName": "Ramesh", 
        "id": "sg:person.013773134040.12", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013773134040.12"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "https://doi.org/10.1145/278459.258561", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1000251272"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/160688.160693", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1001933844"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0306-4573(94)00052-5", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1003335471"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/3-540-46439-5_29", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1007910711", 
          "https://doi.org/10.1007/3-540-46439-5_29"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1002/(sici)1097-4571(2000)51:1<69::aid-asi10>3.0.co;2-c", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1016019385"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/191839.191896", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1026027367"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/s0019-9958(85)80046-2", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1027712063"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/96749.98245", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1028683478"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/3-540-47714-4_13", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1029669526", 
          "https://doi.org/10.1007/3-540-47714-4_13"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/21987", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1034820983", 
          "https://doi.org/10.1038/21987"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/s0169-7552(98)00110-x", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1035913093"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/371920.372095", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1048178520"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/857166.857170", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1049946035"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/359842.359859", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1051315984"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1109/2.841784", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1061106233"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1137/0206024", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1062841363"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2007-03", 
    "datePublishedReg": "2007-03-01", 
    "description": "Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed. In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.", 
    "genre": "research_article", 
    "id": "sg:pub.10.1007/s11280-006-0009-2", 
    "inLanguage": [
      "en"
    ], 
    "isAccessibleForFree": true, 
    "isPartOf": [
      {
        "id": "sg:journal.1136663", 
        "issn": [
          "1386-145X", 
          "1573-1413"
        ], 
        "name": "World Wide Web", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "1", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "10"
      }
    ], 
    "name": "Efficient Update of Indexes for Dynamically Changing Web Documents", 
    "pagination": "37-69", 
    "productId": [
      {
        "name": "readcube_id", 
        "type": "PropertyValue", 
        "value": [
          "17d530c127868ffbb9b78f664e46a6c59061b5df326cf0e8fe61a84e1e28a74a"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1007/s11280-006-0009-2"
        ]
      }, 
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1005001548"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1007/s11280-006-0009-2", 
      "https://app.dimensions.ai/details/publication/pub.1005001548"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2019-04-10T14:11", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8660_00000520.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "http://link.springer.com/10.1007%2Fs11280-006-0009-2"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/s11280-006-0009-2'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/s11280-006-0009-2'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/s11280-006-0009-2'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/s11280-006-0009-2'


 

This table displays all metadata directly associated to this object as RDF triples.

148 TRIPLES      21 PREDICATES      43 URIs      19 LITERALS      7 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1007/s11280-006-0009-2 schema:about anzsrc-for:08
2 anzsrc-for:0801
3 schema:author N39eda451301b4710b2916d325d4e67b6
4 schema:citation sg:pub.10.1007/3-540-46439-5_29
5 sg:pub.10.1007/3-540-47714-4_13
6 sg:pub.10.1038/21987
7 https://doi.org/10.1002/(sici)1097-4571(2000)51:1<69::aid-asi10>3.0.co;2-c
8 https://doi.org/10.1016/0306-4573(94)00052-5
9 https://doi.org/10.1016/s0019-9958(85)80046-2
10 https://doi.org/10.1016/s0169-7552(98)00110-x
11 https://doi.org/10.1109/2.841784
12 https://doi.org/10.1137/0206024
13 https://doi.org/10.1145/160688.160693
14 https://doi.org/10.1145/191839.191896
15 https://doi.org/10.1145/278459.258561
16 https://doi.org/10.1145/359842.359859
17 https://doi.org/10.1145/371920.372095
18 https://doi.org/10.1145/857166.857170
19 https://doi.org/10.1145/96749.98245
20 schema:datePublished 2007-03
21 schema:datePublishedReg 2007-03-01
22 schema:description Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed. In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.
23 schema:genre research_article
24 schema:inLanguage en
25 schema:isAccessibleForFree true
26 schema:isPartOf Nc0a31d713fda4678a60e49ca91d9d12c
27 Nda82616314dc44b08d6ce386709475f3
28 sg:journal.1136663
29 schema:name Efficient Update of Indexes for Dynamically Changing Web Documents
30 schema:pagination 37-69
31 schema:productId N042f1b47a10b4fa7932d76e013759cf5
32 N2e079737628e45b2aeb549156f505264
33 Na35de2ba972c4fcdb0eb115b16f29756
34 schema:sameAs https://app.dimensions.ai/details/publication/pub.1005001548
35 https://doi.org/10.1007/s11280-006-0009-2
36 schema:sdDatePublished 2019-04-10T14:11
37 schema:sdLicense https://scigraph.springernature.com/explorer/license/
38 schema:sdPublisher Na827415e5b5a454397d66b1eb68a9907
39 schema:url http://link.springer.com/10.1007%2Fs11280-006-0009-2
40 sgo:license sg:explorer/license/
41 sgo:sdDataset articles
42 rdf:type schema:ScholarlyArticle
43 N042f1b47a10b4fa7932d76e013759cf5 schema:name doi
44 schema:value 10.1007/s11280-006-0009-2
45 rdf:type schema:PropertyValue
46 N2e079737628e45b2aeb549156f505264 schema:name dimensions_id
47 schema:value pub.1005001548
48 rdf:type schema:PropertyValue
49 N39eda451301b4710b2916d325d4e67b6 rdf:first sg:person.015270452377.19
50 rdf:rest Nc5486863aa9048908789b4e40b91df98
51 N6ebf839dadba4ddca182736d92dff2e8 rdf:first sg:person.013773134040.12
52 rdf:rest rdf:nil
53 N7f654e1b21734045a87d86afd5fb58fb rdf:first sg:person.0613677314.28
54 rdf:rest N6ebf839dadba4ddca182736d92dff2e8
55 N8cc39ade8de449638cc0cb55a4407fcf schema:name IBM Silicon Valley Lab., 555 Bailey Av., 95141, San Jose, CA, USA
56 rdf:type schema:Organization
57 Na35de2ba972c4fcdb0eb115b16f29756 schema:name readcube_id
58 schema:value 17d530c127868ffbb9b78f664e46a6c59061b5df326cf0e8fe61a84e1e28a74a
59 rdf:type schema:PropertyValue
60 Na827415e5b5a454397d66b1eb68a9907 schema:name Springer Nature - SN SciGraph project
61 rdf:type schema:Organization
62 Nc0a31d713fda4678a60e49ca91d9d12c schema:issueNumber 1
63 rdf:type schema:PublicationIssue
64 Nc5486863aa9048908789b4e40b91df98 rdf:first sg:person.012657435165.75
65 rdf:rest Nc8057503e2784373b6ac801dbc9a229d
66 Nc8057503e2784373b6ac801dbc9a229d rdf:first sg:person.015175246447.02
67 rdf:rest N7f654e1b21734045a87d86afd5fb58fb
68 Nda82616314dc44b08d6ce386709475f3 schema:volumeNumber 10
69 rdf:type schema:PublicationVolume
70 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
71 schema:name Information and Computing Sciences
72 rdf:type schema:DefinedTerm
73 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
74 schema:name Artificial Intelligence and Image Processing
75 rdf:type schema:DefinedTerm
76 sg:journal.1136663 schema:issn 1386-145X
77 1573-1413
78 schema:name World Wide Web
79 rdf:type schema:Periodical
80 sg:person.012657435165.75 schema:affiliation https://www.grid.ac/institutes/grid.481554.9
81 schema:familyName Wang
82 schema:givenName Min
83 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012657435165.75
84 rdf:type schema:Person
85 sg:person.013773134040.12 schema:affiliation https://www.grid.ac/institutes/grid.481551.c
86 schema:familyName Agarwal
87 schema:givenName Ramesh
88 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013773134040.12
89 rdf:type schema:Person
90 sg:person.015175246447.02 schema:affiliation N8cc39ade8de449638cc0cb55a4407fcf
91 schema:familyName Padmanabhan
92 schema:givenName Sriram
93 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015175246447.02
94 rdf:type schema:Person
95 sg:person.015270452377.19 schema:affiliation https://www.grid.ac/institutes/grid.481554.9
96 schema:familyName Lim
97 schema:givenName Lipyeow
98 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015270452377.19
99 rdf:type schema:Person
100 sg:person.0613677314.28 schema:affiliation https://www.grid.ac/institutes/grid.169077.e
101 schema:familyName Vitter
102 schema:givenName Jeffrey Scott
103 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0613677314.28
104 rdf:type schema:Person
105 sg:pub.10.1007/3-540-46439-5_29 schema:sameAs https://app.dimensions.ai/details/publication/pub.1007910711
106 https://doi.org/10.1007/3-540-46439-5_29
107 rdf:type schema:CreativeWork
108 sg:pub.10.1007/3-540-47714-4_13 schema:sameAs https://app.dimensions.ai/details/publication/pub.1029669526
109 https://doi.org/10.1007/3-540-47714-4_13
110 rdf:type schema:CreativeWork
111 sg:pub.10.1038/21987 schema:sameAs https://app.dimensions.ai/details/publication/pub.1034820983
112 https://doi.org/10.1038/21987
113 rdf:type schema:CreativeWork
114 https://doi.org/10.1002/(sici)1097-4571(2000)51:1<69::aid-asi10>3.0.co;2-c schema:sameAs https://app.dimensions.ai/details/publication/pub.1016019385
115 rdf:type schema:CreativeWork
116 https://doi.org/10.1016/0306-4573(94)00052-5 schema:sameAs https://app.dimensions.ai/details/publication/pub.1003335471
117 rdf:type schema:CreativeWork
118 https://doi.org/10.1016/s0019-9958(85)80046-2 schema:sameAs https://app.dimensions.ai/details/publication/pub.1027712063
119 rdf:type schema:CreativeWork
120 https://doi.org/10.1016/s0169-7552(98)00110-x schema:sameAs https://app.dimensions.ai/details/publication/pub.1035913093
121 rdf:type schema:CreativeWork
122 https://doi.org/10.1109/2.841784 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061106233
123 rdf:type schema:CreativeWork
124 https://doi.org/10.1137/0206024 schema:sameAs https://app.dimensions.ai/details/publication/pub.1062841363
125 rdf:type schema:CreativeWork
126 https://doi.org/10.1145/160688.160693 schema:sameAs https://app.dimensions.ai/details/publication/pub.1001933844
127 rdf:type schema:CreativeWork
128 https://doi.org/10.1145/191839.191896 schema:sameAs https://app.dimensions.ai/details/publication/pub.1026027367
129 rdf:type schema:CreativeWork
130 https://doi.org/10.1145/278459.258561 schema:sameAs https://app.dimensions.ai/details/publication/pub.1000251272
131 rdf:type schema:CreativeWork
132 https://doi.org/10.1145/359842.359859 schema:sameAs https://app.dimensions.ai/details/publication/pub.1051315984
133 rdf:type schema:CreativeWork
134 https://doi.org/10.1145/371920.372095 schema:sameAs https://app.dimensions.ai/details/publication/pub.1048178520
135 rdf:type schema:CreativeWork
136 https://doi.org/10.1145/857166.857170 schema:sameAs https://app.dimensions.ai/details/publication/pub.1049946035
137 rdf:type schema:CreativeWork
138 https://doi.org/10.1145/96749.98245 schema:sameAs https://app.dimensions.ai/details/publication/pub.1028683478
139 rdf:type schema:CreativeWork
140 https://www.grid.ac/institutes/grid.169077.e schema:alternateName Purdue University
141 schema:name Purdue University, 150 N. University St., 47907, West Lafayette, IN, USA
142 rdf:type schema:Organization
143 https://www.grid.ac/institutes/grid.481551.c schema:alternateName IBM Research - Almaden
144 schema:name IBM Almaden Research Ctr., 650 Harry Rd., 95120-6099, San Jose, CA, USA
145 rdf:type schema:Organization
146 https://www.grid.ac/institutes/grid.481554.9 schema:alternateName IBM Research – Thomas J. Watson Research Center
147 schema:name IBM T. J. Watson Research Ctr., 19 Skyline Dr., 10532, Hawthorne, NY, USA
148 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...