Near-duplicate document detection with improved similarity measurement View Full Text


Ontology type: schema:ScholarlyArticle     


Article Info

DATE

2012-08

AUTHORS

Xin-pan Yuan, Jun Long, Zu-ping Zhang, Wei-hua Gui

ABSTRACT

To quickly find documents with high similarity in existing documentation sets, fingerprint group merging retrieval algorithm is proposed to address both sides of the problem: a given similarity threshold could not be too low and fewer fingerprints could lead to low accuracy. It can be proved that the efficiency of similarity retrieval is improved by fingerprint group merging retrieval algorithm with lower similarity threshold. Experiments with the lower similarity threshold r=0.7 and high fingerprint bits k=400 demonstrate that the CPU time-consuming cost decreases from 1 921 s to 273 s. Theoretical analysis and experimental results verify the effectiveness of this method. More... »

PAGES

2231-2237

References to SciGraph publications

  • 2009. Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems in STRING PROCESSING AND INFORMATION RETRIEVAL
  • 2002-11-07. Identifying and Filtering Near-Duplicate Documents in COMBINATORIAL PATTERN MATCHING
  • 2010. Fingerprinting Ratings for Collaborative Filtering — Theoretical and Empirical Analysis in STRING PROCESSING AND INFORMATION RETRIEVAL
  • Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/s11771-012-1267-z

    DOI

    http://dx.doi.org/10.1007/s11771-012-1267-z

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1047178375


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Artificial Intelligence and Image Processing", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "Central South University", 
              "id": "https://www.grid.ac/institutes/grid.216417.7", 
              "name": [
                "School of Information Science and Engineering, Central South University, 410083, Changsha, China"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Yuan", 
            "givenName": "Xin-pan", 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "Central South University", 
              "id": "https://www.grid.ac/institutes/grid.216417.7", 
              "name": [
                "School of Information Science and Engineering, Central South University, 410083, Changsha, China"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Long", 
            "givenName": "Jun", 
            "id": "sg:person.01164362421.69", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01164362421.69"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "Central South University", 
              "id": "https://www.grid.ac/institutes/grid.216417.7", 
              "name": [
                "School of Information Science and Engineering, Central South University, 410083, Changsha, China"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Zhang", 
            "givenName": "Zu-ping", 
            "id": "sg:person.0622141222.51", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0622141222.51"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "Central South University", 
              "id": "https://www.grid.ac/institutes/grid.216417.7", 
              "name": [
                "School of Information Science and Engineering, Central South University, 410083, Changsha, China"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Gui", 
            "givenName": "Wei-hua", 
            "id": "sg:person.014564074022.19", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014564074022.19"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "https://doi.org/10.1145/1772690.1772759", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1003368044"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1006/jcss.1999.1690", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1007368113"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/1513876.1513879", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1007549781"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-45123-4_1", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1008158100", 
              "https://doi.org/10.1007/3-540-45123-4_1"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-45123-4_1", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1008158100", 
              "https://doi.org/10.1007/3-540-45123-4_1"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1006/jagm.2000.1131", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1012942189"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/1978542.1978566", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1015038664"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/1498759.1498835", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1024311375"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/j.comcom.2008.01.001", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1028417316"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/1242572.1242592", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1029933993"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/509907.509965", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1030416589"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-03784-9_34", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1037752319", 
              "https://doi.org/10.1007/978-3-642-03784-9_34"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/1341531.1341547", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1038823884"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-16321-0_3", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1046500355", 
              "https://doi.org/10.1007/978-3-642-16321-0_3"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-16321-0_3", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1046500355", 
              "https://doi.org/10.1007/978-3-642-16321-0_3"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1137/1.9781611973082.5", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1088801370"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/sequen.1997.666900", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1095535976"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2012-08", 
        "datePublishedReg": "2012-08-01", 
        "description": "To quickly find documents with high similarity in existing documentation sets, fingerprint group merging retrieval algorithm is proposed to address both sides of the problem: a given similarity threshold could not be too low and fewer fingerprints could lead to low accuracy. It can be proved that the efficiency of similarity retrieval is improved by fingerprint group merging retrieval algorithm with lower similarity threshold. Experiments with the lower similarity threshold r=0.7 and high fingerprint bits k=400 demonstrate that the CPU time-consuming cost decreases from 1 921 s to 273 s. Theoretical analysis and experimental results verify the effectiveness of this method.", 
        "genre": "research_article", 
        "id": "sg:pub.10.1007/s11771-012-1267-z", 
        "inLanguage": [
          "en"
        ], 
        "isAccessibleForFree": false, 
        "isPartOf": [
          {
            "id": "sg:journal.1135866", 
            "issn": [
              "2095-2899", 
              "2227-5223"
            ], 
            "name": "Journal of Central South University", 
            "type": "Periodical"
          }, 
          {
            "issueNumber": "8", 
            "type": "PublicationIssue"
          }, 
          {
            "type": "PublicationVolume", 
            "volumeNumber": "19"
          }
        ], 
        "name": "Near-duplicate document detection with improved similarity measurement", 
        "pagination": "2231-2237", 
        "productId": [
          {
            "name": "readcube_id", 
            "type": "PropertyValue", 
            "value": [
              "e0bdb81d5658c3c8c945a197b0996a11df4c6693cb79f7e5dfbace519059159a"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/s11771-012-1267-z"
            ]
          }, 
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1047178375"
            ]
          }
        ], 
        "sameAs": [
          "https://doi.org/10.1007/s11771-012-1267-z", 
          "https://app.dimensions.ai/details/publication/pub.1047178375"
        ], 
        "sdDataset": "articles", 
        "sdDatePublished": "2019-04-10T20:50", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8684_00000524.jsonl", 
        "type": "ScholarlyArticle", 
        "url": "http://link.springer.com/10.1007%2Fs11771-012-1267-z"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/s11771-012-1267-z'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/s11771-012-1267-z'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/s11771-012-1267-z'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/s11771-012-1267-z'


     

    This table displays all metadata directly associated to this object as RDF triples.

    129 TRIPLES      21 PREDICATES      42 URIs      19 LITERALS      7 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/s11771-012-1267-z schema:about anzsrc-for:08
    2 anzsrc-for:0801
    3 schema:author N68318d84dda34988b4fcce4fe05d48ae
    4 schema:citation sg:pub.10.1007/3-540-45123-4_1
    5 sg:pub.10.1007/978-3-642-03784-9_34
    6 sg:pub.10.1007/978-3-642-16321-0_3
    7 https://doi.org/10.1006/jagm.2000.1131
    8 https://doi.org/10.1006/jcss.1999.1690
    9 https://doi.org/10.1016/j.comcom.2008.01.001
    10 https://doi.org/10.1109/sequen.1997.666900
    11 https://doi.org/10.1137/1.9781611973082.5
    12 https://doi.org/10.1145/1242572.1242592
    13 https://doi.org/10.1145/1341531.1341547
    14 https://doi.org/10.1145/1498759.1498835
    15 https://doi.org/10.1145/1513876.1513879
    16 https://doi.org/10.1145/1772690.1772759
    17 https://doi.org/10.1145/1978542.1978566
    18 https://doi.org/10.1145/509907.509965
    19 schema:datePublished 2012-08
    20 schema:datePublishedReg 2012-08-01
    21 schema:description To quickly find documents with high similarity in existing documentation sets, fingerprint group merging retrieval algorithm is proposed to address both sides of the problem: a given similarity threshold could not be too low and fewer fingerprints could lead to low accuracy. It can be proved that the efficiency of similarity retrieval is improved by fingerprint group merging retrieval algorithm with lower similarity threshold. Experiments with the lower similarity threshold r=0.7 and high fingerprint bits k=400 demonstrate that the CPU time-consuming cost decreases from 1 921 s to 273 s. Theoretical analysis and experimental results verify the effectiveness of this method.
    22 schema:genre research_article
    23 schema:inLanguage en
    24 schema:isAccessibleForFree false
    25 schema:isPartOf N7fc7ffd4e1eb4ce2abfc461fd4833c6f
    26 N83805bd938a0417db3718ead228e163e
    27 sg:journal.1135866
    28 schema:name Near-duplicate document detection with improved similarity measurement
    29 schema:pagination 2231-2237
    30 schema:productId N03cce43e3c39422d858b80eaf4548b8b
    31 N1f329969ce3441a39e314cf64a1d6d2b
    32 Nceea7c4a43c5471db14bd094df2f9943
    33 schema:sameAs https://app.dimensions.ai/details/publication/pub.1047178375
    34 https://doi.org/10.1007/s11771-012-1267-z
    35 schema:sdDatePublished 2019-04-10T20:50
    36 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    37 schema:sdPublisher Ne627e3ca0f954b74ae873b7eba9627d1
    38 schema:url http://link.springer.com/10.1007%2Fs11771-012-1267-z
    39 sgo:license sg:explorer/license/
    40 sgo:sdDataset articles
    41 rdf:type schema:ScholarlyArticle
    42 N03cce43e3c39422d858b80eaf4548b8b schema:name dimensions_id
    43 schema:value pub.1047178375
    44 rdf:type schema:PropertyValue
    45 N1f329969ce3441a39e314cf64a1d6d2b schema:name doi
    46 schema:value 10.1007/s11771-012-1267-z
    47 rdf:type schema:PropertyValue
    48 N28ec1303b07e462092367ea082c9cd65 rdf:first sg:person.0622141222.51
    49 rdf:rest Nb3e86a892e9d4056a26f159062ed822e
    50 N30575c2d52a844a6b1d07bd556af630e schema:affiliation https://www.grid.ac/institutes/grid.216417.7
    51 schema:familyName Yuan
    52 schema:givenName Xin-pan
    53 rdf:type schema:Person
    54 N68318d84dda34988b4fcce4fe05d48ae rdf:first N30575c2d52a844a6b1d07bd556af630e
    55 rdf:rest N83332f1931fb45a4981c5a29dd1611c3
    56 N7fc7ffd4e1eb4ce2abfc461fd4833c6f schema:volumeNumber 19
    57 rdf:type schema:PublicationVolume
    58 N83332f1931fb45a4981c5a29dd1611c3 rdf:first sg:person.01164362421.69
    59 rdf:rest N28ec1303b07e462092367ea082c9cd65
    60 N83805bd938a0417db3718ead228e163e schema:issueNumber 8
    61 rdf:type schema:PublicationIssue
    62 Nb3e86a892e9d4056a26f159062ed822e rdf:first sg:person.014564074022.19
    63 rdf:rest rdf:nil
    64 Nceea7c4a43c5471db14bd094df2f9943 schema:name readcube_id
    65 schema:value e0bdb81d5658c3c8c945a197b0996a11df4c6693cb79f7e5dfbace519059159a
    66 rdf:type schema:PropertyValue
    67 Ne627e3ca0f954b74ae873b7eba9627d1 schema:name Springer Nature - SN SciGraph project
    68 rdf:type schema:Organization
    69 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    70 schema:name Information and Computing Sciences
    71 rdf:type schema:DefinedTerm
    72 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
    73 schema:name Artificial Intelligence and Image Processing
    74 rdf:type schema:DefinedTerm
    75 sg:journal.1135866 schema:issn 2095-2899
    76 2227-5223
    77 schema:name Journal of Central South University
    78 rdf:type schema:Periodical
    79 sg:person.01164362421.69 schema:affiliation https://www.grid.ac/institutes/grid.216417.7
    80 schema:familyName Long
    81 schema:givenName Jun
    82 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01164362421.69
    83 rdf:type schema:Person
    84 sg:person.014564074022.19 schema:affiliation https://www.grid.ac/institutes/grid.216417.7
    85 schema:familyName Gui
    86 schema:givenName Wei-hua
    87 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014564074022.19
    88 rdf:type schema:Person
    89 sg:person.0622141222.51 schema:affiliation https://www.grid.ac/institutes/grid.216417.7
    90 schema:familyName Zhang
    91 schema:givenName Zu-ping
    92 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0622141222.51
    93 rdf:type schema:Person
    94 sg:pub.10.1007/3-540-45123-4_1 schema:sameAs https://app.dimensions.ai/details/publication/pub.1008158100
    95 https://doi.org/10.1007/3-540-45123-4_1
    96 rdf:type schema:CreativeWork
    97 sg:pub.10.1007/978-3-642-03784-9_34 schema:sameAs https://app.dimensions.ai/details/publication/pub.1037752319
    98 https://doi.org/10.1007/978-3-642-03784-9_34
    99 rdf:type schema:CreativeWork
    100 sg:pub.10.1007/978-3-642-16321-0_3 schema:sameAs https://app.dimensions.ai/details/publication/pub.1046500355
    101 https://doi.org/10.1007/978-3-642-16321-0_3
    102 rdf:type schema:CreativeWork
    103 https://doi.org/10.1006/jagm.2000.1131 schema:sameAs https://app.dimensions.ai/details/publication/pub.1012942189
    104 rdf:type schema:CreativeWork
    105 https://doi.org/10.1006/jcss.1999.1690 schema:sameAs https://app.dimensions.ai/details/publication/pub.1007368113
    106 rdf:type schema:CreativeWork
    107 https://doi.org/10.1016/j.comcom.2008.01.001 schema:sameAs https://app.dimensions.ai/details/publication/pub.1028417316
    108 rdf:type schema:CreativeWork
    109 https://doi.org/10.1109/sequen.1997.666900 schema:sameAs https://app.dimensions.ai/details/publication/pub.1095535976
    110 rdf:type schema:CreativeWork
    111 https://doi.org/10.1137/1.9781611973082.5 schema:sameAs https://app.dimensions.ai/details/publication/pub.1088801370
    112 rdf:type schema:CreativeWork
    113 https://doi.org/10.1145/1242572.1242592 schema:sameAs https://app.dimensions.ai/details/publication/pub.1029933993
    114 rdf:type schema:CreativeWork
    115 https://doi.org/10.1145/1341531.1341547 schema:sameAs https://app.dimensions.ai/details/publication/pub.1038823884
    116 rdf:type schema:CreativeWork
    117 https://doi.org/10.1145/1498759.1498835 schema:sameAs https://app.dimensions.ai/details/publication/pub.1024311375
    118 rdf:type schema:CreativeWork
    119 https://doi.org/10.1145/1513876.1513879 schema:sameAs https://app.dimensions.ai/details/publication/pub.1007549781
    120 rdf:type schema:CreativeWork
    121 https://doi.org/10.1145/1772690.1772759 schema:sameAs https://app.dimensions.ai/details/publication/pub.1003368044
    122 rdf:type schema:CreativeWork
    123 https://doi.org/10.1145/1978542.1978566 schema:sameAs https://app.dimensions.ai/details/publication/pub.1015038664
    124 rdf:type schema:CreativeWork
    125 https://doi.org/10.1145/509907.509965 schema:sameAs https://app.dimensions.ai/details/publication/pub.1030416589
    126 rdf:type schema:CreativeWork
    127 https://www.grid.ac/institutes/grid.216417.7 schema:alternateName Central South University
    128 schema:name School of Information Science and Engineering, Central South University, 410083, Changsha, China
    129 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...