Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values View Full Text


Ontology type: schema:ScholarlyArticle     


Article Info

DATE

1998-09

AUTHORS

Zhexue Huang

ABSTRACT

The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications. More... »

PAGES

283-304

References to SciGraph publications

Identifiers

URI

http://scigraph.springernature.com/pub.10.1023/a:1009769707641

DOI

http://dx.doi.org/10.1023/a:1009769707641

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1027035492


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Artificial Intelligence and Image Processing", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "name": [
            "ACSys CRC, CSIRO Mathematical and Information Sciences, GPO Box 664, 2601, Canberra, ACT, Australia"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Huang", 
        "givenName": "Zhexue", 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "sg:pub.10.1007/bf00114264", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1000846536", 
          "https://doi.org/10.1007/bf00114264"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(91)90022-w", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1006308944"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(91)90022-w", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1006308944"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0020-0255(73)90043-1", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1008244991"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0020-0255(73)90043-1", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1008244991"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf02293899", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1008979876", 
          "https://doi.org/10.1007/bf02293899"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf02294245", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1012945757", 
          "https://doi.org/10.1007/bf02294245"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf02294245", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1012945757", 
          "https://doi.org/10.1007/bf02294245"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/s0019-9958(69)90591-9", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1015649493"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf00114265", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1017000685", 
          "https://doi.org/10.1007/bf00114265"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0167-8655(95)00075-r", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1019728548"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1002/bs.3830120210", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1020292180"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(79)90034-7", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1034901103"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(79)90034-7", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1034901103"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(87)90034-3", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1037415433"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(87)90034-3", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1037415433"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf02294153", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1040760134", 
          "https://doi.org/10.1007/bf02294153"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf02294153", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1040760134", 
          "https://doi.org/10.1007/bf02294153"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(80)90001-1", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1046710985"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(80)90001-1", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1046710985"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/233269.233324", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1053685676"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1109/21.97475", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1061122466"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1109/34.9111", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1061157242"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1109/tpami.1983.4767409", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1061741968"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1109/tpami.1984.4767478", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1061742013"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.2307/2528823", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1069974562"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.2307/2344237", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1103088593"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.2307/2344237", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1103088593"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "1998-09", 
    "datePublishedReg": "1998-09-01", 
    "description": "The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.", 
    "genre": "research_article", 
    "id": "sg:pub.10.1023/a:1009769707641", 
    "inLanguage": [
      "en"
    ], 
    "isAccessibleForFree": false, 
    "isPartOf": [
      {
        "id": "sg:journal.1041853", 
        "issn": [
          "1384-5810", 
          "1573-756X"
        ], 
        "name": "Data Mining and Knowledge Discovery", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "3", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "2"
      }
    ], 
    "name": "Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values", 
    "pagination": "283-304", 
    "productId": [
      {
        "name": "readcube_id", 
        "type": "PropertyValue", 
        "value": [
          "80ca94e1f3e994798ee64caf2a3bffd5f3bbfdf815fd7dd72a6727507d4e6122"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1023/a:1009769707641"
        ]
      }, 
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1027035492"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1023/a:1009769707641", 
      "https://app.dimensions.ai/details/publication/pub.1027035492"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2019-04-10T18:25", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8675_00000537.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "http://link.springer.com/10.1023%2FA%3A1009769707641"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1023/a:1009769707641'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1023/a:1009769707641'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1023/a:1009769707641'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1023/a:1009769707641'


 

This table displays all metadata directly associated to this object as RDF triples.

124 TRIPLES      21 PREDICATES      47 URIs      19 LITERALS      7 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1023/a:1009769707641 schema:about anzsrc-for:08
2 anzsrc-for:0801
3 schema:author Nbd893a4b1922471bae06371a224dc5fd
4 schema:citation sg:pub.10.1007/bf00114264
5 sg:pub.10.1007/bf00114265
6 sg:pub.10.1007/bf02293899
7 sg:pub.10.1007/bf02294153
8 sg:pub.10.1007/bf02294245
9 https://doi.org/10.1002/bs.3830120210
10 https://doi.org/10.1016/0020-0255(73)90043-1
11 https://doi.org/10.1016/0031-3203(79)90034-7
12 https://doi.org/10.1016/0031-3203(80)90001-1
13 https://doi.org/10.1016/0031-3203(87)90034-3
14 https://doi.org/10.1016/0031-3203(91)90022-w
15 https://doi.org/10.1016/0167-8655(95)00075-r
16 https://doi.org/10.1016/s0019-9958(69)90591-9
17 https://doi.org/10.1109/21.97475
18 https://doi.org/10.1109/34.9111
19 https://doi.org/10.1109/tpami.1983.4767409
20 https://doi.org/10.1109/tpami.1984.4767478
21 https://doi.org/10.1145/233269.233324
22 https://doi.org/10.2307/2344237
23 https://doi.org/10.2307/2528823
24 schema:datePublished 1998-09
25 schema:datePublishedReg 1998-09-01
26 schema:description The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
27 schema:genre research_article
28 schema:inLanguage en
29 schema:isAccessibleForFree false
30 schema:isPartOf N2cabfa846db44ad3a2ffe49efe7f4ad4
31 N3d917c5fa9484a25a484509de225d329
32 sg:journal.1041853
33 schema:name Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
34 schema:pagination 283-304
35 schema:productId N8ea8020226714d45bf3d5cd220c8a79e
36 Nf04da9cc17da450a89e59f5fac817830
37 Nf27028bc756a4dbba6ced00459bda1b0
38 schema:sameAs https://app.dimensions.ai/details/publication/pub.1027035492
39 https://doi.org/10.1023/a:1009769707641
40 schema:sdDatePublished 2019-04-10T18:25
41 schema:sdLicense https://scigraph.springernature.com/explorer/license/
42 schema:sdPublisher N2668f4338115484db5f0c31265d8d270
43 schema:url http://link.springer.com/10.1023%2FA%3A1009769707641
44 sgo:license sg:explorer/license/
45 sgo:sdDataset articles
46 rdf:type schema:ScholarlyArticle
47 N2668f4338115484db5f0c31265d8d270 schema:name Springer Nature - SN SciGraph project
48 rdf:type schema:Organization
49 N2ac7b85a3e914cc8924c246d0a84c760 schema:affiliation Nbfa00469b0384d7b83420e6bcadc0226
50 schema:familyName Huang
51 schema:givenName Zhexue
52 rdf:type schema:Person
53 N2cabfa846db44ad3a2ffe49efe7f4ad4 schema:issueNumber 3
54 rdf:type schema:PublicationIssue
55 N3d917c5fa9484a25a484509de225d329 schema:volumeNumber 2
56 rdf:type schema:PublicationVolume
57 N8ea8020226714d45bf3d5cd220c8a79e schema:name doi
58 schema:value 10.1023/a:1009769707641
59 rdf:type schema:PropertyValue
60 Nbd893a4b1922471bae06371a224dc5fd rdf:first N2ac7b85a3e914cc8924c246d0a84c760
61 rdf:rest rdf:nil
62 Nbfa00469b0384d7b83420e6bcadc0226 schema:name ACSys CRC, CSIRO Mathematical and Information Sciences, GPO Box 664, 2601, Canberra, ACT, Australia
63 rdf:type schema:Organization
64 Nf04da9cc17da450a89e59f5fac817830 schema:name dimensions_id
65 schema:value pub.1027035492
66 rdf:type schema:PropertyValue
67 Nf27028bc756a4dbba6ced00459bda1b0 schema:name readcube_id
68 schema:value 80ca94e1f3e994798ee64caf2a3bffd5f3bbfdf815fd7dd72a6727507d4e6122
69 rdf:type schema:PropertyValue
70 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
71 schema:name Information and Computing Sciences
72 rdf:type schema:DefinedTerm
73 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
74 schema:name Artificial Intelligence and Image Processing
75 rdf:type schema:DefinedTerm
76 sg:journal.1041853 schema:issn 1384-5810
77 1573-756X
78 schema:name Data Mining and Knowledge Discovery
79 rdf:type schema:Periodical
80 sg:pub.10.1007/bf00114264 schema:sameAs https://app.dimensions.ai/details/publication/pub.1000846536
81 https://doi.org/10.1007/bf00114264
82 rdf:type schema:CreativeWork
83 sg:pub.10.1007/bf00114265 schema:sameAs https://app.dimensions.ai/details/publication/pub.1017000685
84 https://doi.org/10.1007/bf00114265
85 rdf:type schema:CreativeWork
86 sg:pub.10.1007/bf02293899 schema:sameAs https://app.dimensions.ai/details/publication/pub.1008979876
87 https://doi.org/10.1007/bf02293899
88 rdf:type schema:CreativeWork
89 sg:pub.10.1007/bf02294153 schema:sameAs https://app.dimensions.ai/details/publication/pub.1040760134
90 https://doi.org/10.1007/bf02294153
91 rdf:type schema:CreativeWork
92 sg:pub.10.1007/bf02294245 schema:sameAs https://app.dimensions.ai/details/publication/pub.1012945757
93 https://doi.org/10.1007/bf02294245
94 rdf:type schema:CreativeWork
95 https://doi.org/10.1002/bs.3830120210 schema:sameAs https://app.dimensions.ai/details/publication/pub.1020292180
96 rdf:type schema:CreativeWork
97 https://doi.org/10.1016/0020-0255(73)90043-1 schema:sameAs https://app.dimensions.ai/details/publication/pub.1008244991
98 rdf:type schema:CreativeWork
99 https://doi.org/10.1016/0031-3203(79)90034-7 schema:sameAs https://app.dimensions.ai/details/publication/pub.1034901103
100 rdf:type schema:CreativeWork
101 https://doi.org/10.1016/0031-3203(80)90001-1 schema:sameAs https://app.dimensions.ai/details/publication/pub.1046710985
102 rdf:type schema:CreativeWork
103 https://doi.org/10.1016/0031-3203(87)90034-3 schema:sameAs https://app.dimensions.ai/details/publication/pub.1037415433
104 rdf:type schema:CreativeWork
105 https://doi.org/10.1016/0031-3203(91)90022-w schema:sameAs https://app.dimensions.ai/details/publication/pub.1006308944
106 rdf:type schema:CreativeWork
107 https://doi.org/10.1016/0167-8655(95)00075-r schema:sameAs https://app.dimensions.ai/details/publication/pub.1019728548
108 rdf:type schema:CreativeWork
109 https://doi.org/10.1016/s0019-9958(69)90591-9 schema:sameAs https://app.dimensions.ai/details/publication/pub.1015649493
110 rdf:type schema:CreativeWork
111 https://doi.org/10.1109/21.97475 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061122466
112 rdf:type schema:CreativeWork
113 https://doi.org/10.1109/34.9111 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061157242
114 rdf:type schema:CreativeWork
115 https://doi.org/10.1109/tpami.1983.4767409 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061741968
116 rdf:type schema:CreativeWork
117 https://doi.org/10.1109/tpami.1984.4767478 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061742013
118 rdf:type schema:CreativeWork
119 https://doi.org/10.1145/233269.233324 schema:sameAs https://app.dimensions.ai/details/publication/pub.1053685676
120 rdf:type schema:CreativeWork
121 https://doi.org/10.2307/2344237 schema:sameAs https://app.dimensions.ai/details/publication/pub.1103088593
122 rdf:type schema:CreativeWork
123 https://doi.org/10.2307/2528823 schema:sameAs https://app.dimensions.ai/details/publication/pub.1069974562
124 rdf:type schema:CreativeWork
 




Preview window. Press ESC to close (or click here)


...