Information Retrieval can Cope with Many Errors View Full Text


Ontology type: schema:ScholarlyArticle     


Article Info

DATE

2000-10

AUTHORS

Elke Mittendorf, Peter Schäuble

ABSTRACT

The retrieval of documents that originate from digitized and OCR-converted paper documents is an important task for modern retrieval systems. The problems that OCR errors cause for the retrieval process have been subject to research for several years now. We approach the problem from a theoretical point of view and model OCR conversion as a random experiment. Our theoretical results, which are supported by experiments, show clearly that information retrieval can cope even with many errors. It is, however, important that the documents are not too short and that recognition errors are distributed appropriately among words and documents. These results disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care. More... »

PAGES

189-216

Identifiers

URI

http://scigraph.springernature.com/pub.10.1023/a:1026564708926

DOI

http://dx.doi.org/10.1023/a:1026564708926

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1012167866


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Artificial Intelligence and Image Processing", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "name": [
            "Systor A6, CH-8048, Z\u00fcrich, Switzerland"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Mittendorf", 
        "givenName": "Elke", 
        "id": "sg:person.010437463675.80", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.010437463675.80"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Eurospider Information Technology (Switzerland)", 
          "id": "https://www.grid.ac/institutes/grid.433769.c", 
          "name": [
            "Eurospider Information Technology AG, CH-8006, Z\u00fcrich, Switzerland"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Sch\u00e4uble", 
        "givenName": "Peter", 
        "id": "sg:person.0670254567.14", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0670254567.14"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "https://doi.org/10.1093/comjnl/35.3.243", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1003892145"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/190627.190645", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1005540621"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bfb0026737", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1011780554", 
          "https://doi.org/10.1007/bfb0026737"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/978-1-4471-2099-5_21", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1017153872", 
          "https://doi.org/10.1007/978-1-4471-2099-5_21"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.3115/1075812.1075897", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1020670887"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bfb0026851", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1025124940", 
          "https://doi.org/10.1007/bfb0026851"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/978-3-663-11499-4", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1031890788", 
          "https://doi.org/10.1007/978-3-663-11499-4"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/978-3-663-11499-4", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1031890788", 
          "https://doi.org/10.1007/978-3-663-11499-4"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/978-1-4471-2099-5_24", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1032195050", 
          "https://doi.org/10.1007/978-1-4471-2099-5_24"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/243199.243206", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1032416718"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/243199.243208", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1033593408"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/978-1-4471-2099-5_15", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1035660246", 
          "https://doi.org/10.1007/978-1-4471-2099-5_15"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1108/eb046814", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1037275209"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/215206.215379", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1047055919"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/243199.243202", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1048194338"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2000-10", 
    "datePublishedReg": "2000-10-01", 
    "description": "The retrieval of documents that originate from digitized and OCR-converted paper documents is an important task for modern retrieval systems. The problems that OCR errors cause for the retrieval process have been subject to research for several years now. We approach the problem from a theoretical point of view and model OCR conversion as a random experiment. Our theoretical results, which are supported by experiments, show clearly that information retrieval can cope even with many errors. It is, however, important that the documents are not too short and that recognition errors are distributed appropriately among words and documents. These results disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care.", 
    "genre": "research_article", 
    "id": "sg:pub.10.1023/a:1026564708926", 
    "inLanguage": [
      "en"
    ], 
    "isAccessibleForFree": false, 
    "isPartOf": [
      {
        "id": "sg:journal.1023664", 
        "issn": [
          "1386-4564", 
          "1573-7659"
        ], 
        "name": "Information Retrieval Journal", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "3", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "3"
      }
    ], 
    "name": "Information Retrieval can Cope with Many Errors", 
    "pagination": "189-216", 
    "productId": [
      {
        "name": "readcube_id", 
        "type": "PropertyValue", 
        "value": [
          "39f8a447aeafa48bdb6ed3f9ab5478562216fc49cf38d206e988e559bdcce5c0"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1023/a:1026564708926"
        ]
      }, 
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1012167866"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1023/a:1026564708926", 
      "https://app.dimensions.ai/details/publication/pub.1012167866"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2019-04-10T17:36", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8672_00000536.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "http://link.springer.com/10.1023%2FA%3A1026564708926"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1023/a:1026564708926'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1023/a:1026564708926'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1023/a:1026564708926'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1023/a:1026564708926'


 

This table displays all metadata directly associated to this object as RDF triples.

118 TRIPLES      21 PREDICATES      41 URIs      19 LITERALS      7 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1023/a:1026564708926 schema:about anzsrc-for:08
2 anzsrc-for:0801
3 schema:author Ne2956124de784c6382d0d953b3132eef
4 schema:citation sg:pub.10.1007/978-1-4471-2099-5_15
5 sg:pub.10.1007/978-1-4471-2099-5_21
6 sg:pub.10.1007/978-1-4471-2099-5_24
7 sg:pub.10.1007/978-3-663-11499-4
8 sg:pub.10.1007/bfb0026737
9 sg:pub.10.1007/bfb0026851
10 https://doi.org/10.1093/comjnl/35.3.243
11 https://doi.org/10.1108/eb046814
12 https://doi.org/10.1145/190627.190645
13 https://doi.org/10.1145/215206.215379
14 https://doi.org/10.1145/243199.243202
15 https://doi.org/10.1145/243199.243206
16 https://doi.org/10.1145/243199.243208
17 https://doi.org/10.3115/1075812.1075897
18 schema:datePublished 2000-10
19 schema:datePublishedReg 2000-10-01
20 schema:description The retrieval of documents that originate from digitized and OCR-converted paper documents is an important task for modern retrieval systems. The problems that OCR errors cause for the retrieval process have been subject to research for several years now. We approach the problem from a theoretical point of view and model OCR conversion as a random experiment. Our theoretical results, which are supported by experiments, show clearly that information retrieval can cope even with many errors. It is, however, important that the documents are not too short and that recognition errors are distributed appropriately among words and documents. These results disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care.
21 schema:genre research_article
22 schema:inLanguage en
23 schema:isAccessibleForFree false
24 schema:isPartOf Ndb6c31b3f7c24a9cae5debadecee332e
25 Ndb6deb16f1d7417f82ad94c957bb9594
26 sg:journal.1023664
27 schema:name Information Retrieval can Cope with Many Errors
28 schema:pagination 189-216
29 schema:productId N2de7fb47ea054a2ebcd9bcb9b4f06ef3
30 N6a3acaeabf9747d4849a053d4a3e57e5
31 N903020cc97ac44d598491d0cef6cad06
32 schema:sameAs https://app.dimensions.ai/details/publication/pub.1012167866
33 https://doi.org/10.1023/a:1026564708926
34 schema:sdDatePublished 2019-04-10T17:36
35 schema:sdLicense https://scigraph.springernature.com/explorer/license/
36 schema:sdPublisher Ne2aba9e40218482083f636873d589c30
37 schema:url http://link.springer.com/10.1023%2FA%3A1026564708926
38 sgo:license sg:explorer/license/
39 sgo:sdDataset articles
40 rdf:type schema:ScholarlyArticle
41 N2d5ce1235a9e4326a972b648a031a808 schema:name Systor A6, CH-8048, Zürich, Switzerland
42 rdf:type schema:Organization
43 N2de7fb47ea054a2ebcd9bcb9b4f06ef3 schema:name dimensions_id
44 schema:value pub.1012167866
45 rdf:type schema:PropertyValue
46 N6a3acaeabf9747d4849a053d4a3e57e5 schema:name doi
47 schema:value 10.1023/a:1026564708926
48 rdf:type schema:PropertyValue
49 N82d63c8b725b47789fc831ba7ab6a6be rdf:first sg:person.0670254567.14
50 rdf:rest rdf:nil
51 N903020cc97ac44d598491d0cef6cad06 schema:name readcube_id
52 schema:value 39f8a447aeafa48bdb6ed3f9ab5478562216fc49cf38d206e988e559bdcce5c0
53 rdf:type schema:PropertyValue
54 Ndb6c31b3f7c24a9cae5debadecee332e schema:volumeNumber 3
55 rdf:type schema:PublicationVolume
56 Ndb6deb16f1d7417f82ad94c957bb9594 schema:issueNumber 3
57 rdf:type schema:PublicationIssue
58 Ne2956124de784c6382d0d953b3132eef rdf:first sg:person.010437463675.80
59 rdf:rest N82d63c8b725b47789fc831ba7ab6a6be
60 Ne2aba9e40218482083f636873d589c30 schema:name Springer Nature - SN SciGraph project
61 rdf:type schema:Organization
62 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
63 schema:name Information and Computing Sciences
64 rdf:type schema:DefinedTerm
65 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
66 schema:name Artificial Intelligence and Image Processing
67 rdf:type schema:DefinedTerm
68 sg:journal.1023664 schema:issn 1386-4564
69 1573-7659
70 schema:name Information Retrieval Journal
71 rdf:type schema:Periodical
72 sg:person.010437463675.80 schema:affiliation N2d5ce1235a9e4326a972b648a031a808
73 schema:familyName Mittendorf
74 schema:givenName Elke
75 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.010437463675.80
76 rdf:type schema:Person
77 sg:person.0670254567.14 schema:affiliation https://www.grid.ac/institutes/grid.433769.c
78 schema:familyName Schäuble
79 schema:givenName Peter
80 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0670254567.14
81 rdf:type schema:Person
82 sg:pub.10.1007/978-1-4471-2099-5_15 schema:sameAs https://app.dimensions.ai/details/publication/pub.1035660246
83 https://doi.org/10.1007/978-1-4471-2099-5_15
84 rdf:type schema:CreativeWork
85 sg:pub.10.1007/978-1-4471-2099-5_21 schema:sameAs https://app.dimensions.ai/details/publication/pub.1017153872
86 https://doi.org/10.1007/978-1-4471-2099-5_21
87 rdf:type schema:CreativeWork
88 sg:pub.10.1007/978-1-4471-2099-5_24 schema:sameAs https://app.dimensions.ai/details/publication/pub.1032195050
89 https://doi.org/10.1007/978-1-4471-2099-5_24
90 rdf:type schema:CreativeWork
91 sg:pub.10.1007/978-3-663-11499-4 schema:sameAs https://app.dimensions.ai/details/publication/pub.1031890788
92 https://doi.org/10.1007/978-3-663-11499-4
93 rdf:type schema:CreativeWork
94 sg:pub.10.1007/bfb0026737 schema:sameAs https://app.dimensions.ai/details/publication/pub.1011780554
95 https://doi.org/10.1007/bfb0026737
96 rdf:type schema:CreativeWork
97 sg:pub.10.1007/bfb0026851 schema:sameAs https://app.dimensions.ai/details/publication/pub.1025124940
98 https://doi.org/10.1007/bfb0026851
99 rdf:type schema:CreativeWork
100 https://doi.org/10.1093/comjnl/35.3.243 schema:sameAs https://app.dimensions.ai/details/publication/pub.1003892145
101 rdf:type schema:CreativeWork
102 https://doi.org/10.1108/eb046814 schema:sameAs https://app.dimensions.ai/details/publication/pub.1037275209
103 rdf:type schema:CreativeWork
104 https://doi.org/10.1145/190627.190645 schema:sameAs https://app.dimensions.ai/details/publication/pub.1005540621
105 rdf:type schema:CreativeWork
106 https://doi.org/10.1145/215206.215379 schema:sameAs https://app.dimensions.ai/details/publication/pub.1047055919
107 rdf:type schema:CreativeWork
108 https://doi.org/10.1145/243199.243202 schema:sameAs https://app.dimensions.ai/details/publication/pub.1048194338
109 rdf:type schema:CreativeWork
110 https://doi.org/10.1145/243199.243206 schema:sameAs https://app.dimensions.ai/details/publication/pub.1032416718
111 rdf:type schema:CreativeWork
112 https://doi.org/10.1145/243199.243208 schema:sameAs https://app.dimensions.ai/details/publication/pub.1033593408
113 rdf:type schema:CreativeWork
114 https://doi.org/10.3115/1075812.1075897 schema:sameAs https://app.dimensions.ai/details/publication/pub.1020670887
115 rdf:type schema:CreativeWork
116 https://www.grid.ac/institutes/grid.433769.c schema:alternateName Eurospider Information Technology (Switzerland)
117 schema:name Eurospider Information Technology AG, CH-8006, Zürich, Switzerland
118 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...