BIRCH: A New Data Clustering Algorithm and Its Applications View Full Text


Ontology type: schema:ScholarlyArticle     


Article Info

DATE

1997-06

AUTHORS

Tian Zhang, Raghu Ramakrishnan, Miron Livny

ABSTRACT

Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality. In this paper, an efficient and scalable data clustering method is proposed, based on a new in-memory data structure called CF-tree, which serves as an in-memory summary of the data distribution. We have implemented it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two real-life problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression. More... »

PAGES

141-182

References to SciGraph publications

  • 1981. Clustering Analysis and Its Applications in ADVANCES IN INFORMATION SYSTEMS SCIENCE
  • Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1023/a:1009783824328

    DOI

    http://dx.doi.org/10.1023/a:1009783824328

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1000292060


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Artificial Intelligence and Image Processing", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "University of Wisconsin\u2013Madison", 
              "id": "https://www.grid.ac/institutes/grid.14003.36", 
              "name": [
                "Computer Sciences Department, University of Wisconsin, 53706, Madison, WI, U.S.A."
              ], 
              "type": "Organization"
            }, 
            "familyName": "Zhang", 
            "givenName": "Tian", 
            "id": "sg:person.07415234224.37", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.07415234224.37"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "University of Wisconsin\u2013Madison", 
              "id": "https://www.grid.ac/institutes/grid.14003.36", 
              "name": [
                "Computer Sciences Department, University of Wisconsin, 53706, Madison, WI, U.S.A."
              ], 
              "type": "Organization"
            }, 
            "familyName": "Ramakrishnan", 
            "givenName": "Raghu", 
            "id": "sg:person.011162165263.77", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011162165263.77"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "University of Wisconsin\u2013Madison", 
              "id": "https://www.grid.ac/institutes/grid.14003.36", 
              "name": [
                "Computer Sciences Department, University of Wisconsin, 53706, Madison, WI, U.S.A."
              ], 
              "type": "Organization"
            }, 
            "familyName": "Livny", 
            "givenName": "Miron", 
            "id": "sg:person.012714507463.85", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012714507463.85"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "https://doi.org/10.1016/0004-3702(89)90046-5", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1004978830"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/0004-3702(89)90046-5", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1004978830"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1093/comjnl/26.4.354", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1009675711"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1207/s15516709cog0804_1", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1022286223"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1207/s15516709cog0804_1", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1022286223"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/602259.602266", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1042817816"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-1-4613-9883-7_4", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1049814792", 
              "https://doi.org/10.1007/978-1-4613-9883-7_4"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/83.148613", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1061239030"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1613/jair.276", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1105538420"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "1997-06", 
        "datePublishedReg": "1997-06-01", 
        "description": "Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality. In this paper, an efficient and scalable data clustering method is proposed, based on a new in-memory data structure called CF-tree, which serves as an in-memory summary of the data distribution. We have implemented it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two real-life problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression.", 
        "genre": "research_article", 
        "id": "sg:pub.10.1023/a:1009783824328", 
        "inLanguage": [
          "en"
        ], 
        "isAccessibleForFree": false, 
        "isPartOf": [
          {
            "id": "sg:journal.1041853", 
            "issn": [
              "1384-5810", 
              "1573-756X"
            ], 
            "name": "Data Mining and Knowledge Discovery", 
            "type": "Periodical"
          }, 
          {
            "issueNumber": "2", 
            "type": "PublicationIssue"
          }, 
          {
            "type": "PublicationVolume", 
            "volumeNumber": "1"
          }
        ], 
        "name": "BIRCH: A New Data Clustering Algorithm and Its Applications", 
        "pagination": "141-182", 
        "productId": [
          {
            "name": "readcube_id", 
            "type": "PropertyValue", 
            "value": [
              "016373ed4ff2d82428dce09e1e3668acbf7824133236754c6a630c3f2753cff7"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1023/a:1009783824328"
            ]
          }, 
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1000292060"
            ]
          }
        ], 
        "sameAs": [
          "https://doi.org/10.1023/a:1009783824328", 
          "https://app.dimensions.ai/details/publication/pub.1000292060"
        ], 
        "sdDataset": "articles", 
        "sdDatePublished": "2019-04-10T19:13", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8678_00000536.jsonl", 
        "type": "ScholarlyArticle", 
        "url": "http://link.springer.com/10.1023%2FA%3A1009783824328"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1023/a:1009783824328'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1023/a:1009783824328'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1023/a:1009783824328'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1023/a:1009783824328'


     

    This table displays all metadata directly associated to this object as RDF triples.

    97 TRIPLES      21 PREDICATES      34 URIs      19 LITERALS      7 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1023/a:1009783824328 schema:about anzsrc-for:08
    2 anzsrc-for:0801
    3 schema:author Nfd620238fa224caa9ff722550e09398f
    4 schema:citation sg:pub.10.1007/978-1-4613-9883-7_4
    5 https://doi.org/10.1016/0004-3702(89)90046-5
    6 https://doi.org/10.1093/comjnl/26.4.354
    7 https://doi.org/10.1109/83.148613
    8 https://doi.org/10.1145/602259.602266
    9 https://doi.org/10.1207/s15516709cog0804_1
    10 https://doi.org/10.1613/jair.276
    11 schema:datePublished 1997-06
    12 schema:datePublishedReg 1997-06-01
    13 schema:description Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality. In this paper, an efficient and scalable data clustering method is proposed, based on a new in-memory data structure called CF-tree, which serves as an in-memory summary of the data distribution. We have implemented it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two real-life problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression.
    14 schema:genre research_article
    15 schema:inLanguage en
    16 schema:isAccessibleForFree false
    17 schema:isPartOf N21388b4f85b5437db1cae1da278a6bb3
    18 N8b902d5e861f4f13b0717dbbe207fd19
    19 sg:journal.1041853
    20 schema:name BIRCH: A New Data Clustering Algorithm and Its Applications
    21 schema:pagination 141-182
    22 schema:productId N2f7e6ad455e645d8bab546378a468d4e
    23 Na01339e00f1e44488438e447bd261d4a
    24 Nf49cb053aea945da9aa4aee1072cd846
    25 schema:sameAs https://app.dimensions.ai/details/publication/pub.1000292060
    26 https://doi.org/10.1023/a:1009783824328
    27 schema:sdDatePublished 2019-04-10T19:13
    28 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    29 schema:sdPublisher Nadc67eba62f1415db99c636d0a06784b
    30 schema:url http://link.springer.com/10.1023%2FA%3A1009783824328
    31 sgo:license sg:explorer/license/
    32 sgo:sdDataset articles
    33 rdf:type schema:ScholarlyArticle
    34 N21388b4f85b5437db1cae1da278a6bb3 schema:issueNumber 2
    35 rdf:type schema:PublicationIssue
    36 N2f7e6ad455e645d8bab546378a468d4e schema:name dimensions_id
    37 schema:value pub.1000292060
    38 rdf:type schema:PropertyValue
    39 N51cfe659cf9b40299e4dc157c99a86cd rdf:first sg:person.011162165263.77
    40 rdf:rest Nb2fcdbeea27042da8508d6e415a71efd
    41 N8b902d5e861f4f13b0717dbbe207fd19 schema:volumeNumber 1
    42 rdf:type schema:PublicationVolume
    43 Na01339e00f1e44488438e447bd261d4a schema:name doi
    44 schema:value 10.1023/a:1009783824328
    45 rdf:type schema:PropertyValue
    46 Nadc67eba62f1415db99c636d0a06784b schema:name Springer Nature - SN SciGraph project
    47 rdf:type schema:Organization
    48 Nb2fcdbeea27042da8508d6e415a71efd rdf:first sg:person.012714507463.85
    49 rdf:rest rdf:nil
    50 Nf49cb053aea945da9aa4aee1072cd846 schema:name readcube_id
    51 schema:value 016373ed4ff2d82428dce09e1e3668acbf7824133236754c6a630c3f2753cff7
    52 rdf:type schema:PropertyValue
    53 Nfd620238fa224caa9ff722550e09398f rdf:first sg:person.07415234224.37
    54 rdf:rest N51cfe659cf9b40299e4dc157c99a86cd
    55 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    56 schema:name Information and Computing Sciences
    57 rdf:type schema:DefinedTerm
    58 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
    59 schema:name Artificial Intelligence and Image Processing
    60 rdf:type schema:DefinedTerm
    61 sg:journal.1041853 schema:issn 1384-5810
    62 1573-756X
    63 schema:name Data Mining and Knowledge Discovery
    64 rdf:type schema:Periodical
    65 sg:person.011162165263.77 schema:affiliation https://www.grid.ac/institutes/grid.14003.36
    66 schema:familyName Ramakrishnan
    67 schema:givenName Raghu
    68 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011162165263.77
    69 rdf:type schema:Person
    70 sg:person.012714507463.85 schema:affiliation https://www.grid.ac/institutes/grid.14003.36
    71 schema:familyName Livny
    72 schema:givenName Miron
    73 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012714507463.85
    74 rdf:type schema:Person
    75 sg:person.07415234224.37 schema:affiliation https://www.grid.ac/institutes/grid.14003.36
    76 schema:familyName Zhang
    77 schema:givenName Tian
    78 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.07415234224.37
    79 rdf:type schema:Person
    80 sg:pub.10.1007/978-1-4613-9883-7_4 schema:sameAs https://app.dimensions.ai/details/publication/pub.1049814792
    81 https://doi.org/10.1007/978-1-4613-9883-7_4
    82 rdf:type schema:CreativeWork
    83 https://doi.org/10.1016/0004-3702(89)90046-5 schema:sameAs https://app.dimensions.ai/details/publication/pub.1004978830
    84 rdf:type schema:CreativeWork
    85 https://doi.org/10.1093/comjnl/26.4.354 schema:sameAs https://app.dimensions.ai/details/publication/pub.1009675711
    86 rdf:type schema:CreativeWork
    87 https://doi.org/10.1109/83.148613 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061239030
    88 rdf:type schema:CreativeWork
    89 https://doi.org/10.1145/602259.602266 schema:sameAs https://app.dimensions.ai/details/publication/pub.1042817816
    90 rdf:type schema:CreativeWork
    91 https://doi.org/10.1207/s15516709cog0804_1 schema:sameAs https://app.dimensions.ai/details/publication/pub.1022286223
    92 rdf:type schema:CreativeWork
    93 https://doi.org/10.1613/jair.276 schema:sameAs https://app.dimensions.ai/details/publication/pub.1105538420
    94 rdf:type schema:CreativeWork
    95 https://www.grid.ac/institutes/grid.14003.36 schema:alternateName University of Wisconsin–Madison
    96 schema:name Computer Sciences Department, University of Wisconsin, 53706, Madison, WI, U.S.A.
    97 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...