Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values View Full Text


Ontology type: schema:ScholarlyArticle     


Article Info

DATE

1998-09

AUTHORS

Zhexue Huang

ABSTRACT

The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications. More... »

PAGES

283-304

Identifiers

URI

http://scigraph.springernature.com/pub.10.1023/a:1009769707641

DOI

http://dx.doi.org/10.1023/a:1009769707641

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1027035492


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Artificial Intelligence and Image Processing", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "name": [
            "ACSys CRC, CSIRO Mathematical and Information Sciences, GPO Box 664, 2601, Canberra, ACT, Australia"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Huang", 
        "givenName": "Zhexue", 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "sg:pub.10.1007/bf00114264", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1000846536", 
          "https://doi.org/10.1007/bf00114264"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(91)90022-w", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1006308944"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(91)90022-w", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1006308944"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0020-0255(73)90043-1", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1008244991"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0020-0255(73)90043-1", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1008244991"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf02293899", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1008979876", 
          "https://doi.org/10.1007/bf02293899"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf02294245", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1012945757", 
          "https://doi.org/10.1007/bf02294245"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf02294245", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1012945757", 
          "https://doi.org/10.1007/bf02294245"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/s0019-9958(69)90591-9", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1015649493"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf00114265", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1017000685", 
          "https://doi.org/10.1007/bf00114265"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0167-8655(95)00075-r", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1019728548"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1002/bs.3830120210", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1020292180"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(79)90034-7", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1034901103"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(79)90034-7", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1034901103"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(87)90034-3", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1037415433"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(87)90034-3", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1037415433"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf02294153", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1040760134", 
          "https://doi.org/10.1007/bf02294153"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/bf02294153", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1040760134", 
          "https://doi.org/10.1007/bf02294153"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(80)90001-1", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1046710985"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1016/0031-3203(80)90001-1", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1046710985"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/233269.233324", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1053685676"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1109/21.97475", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1061122466"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1109/34.9111", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1061157242"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1109/tpami.1983.4767409", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1061741968"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1109/tpami.1984.4767478", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1061742013"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.2307/2528823", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1069974562"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.2307/2344237", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1103088593"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.2307/2344237", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1103088593"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "1998-09", 
    "datePublishedReg": "1998-09-01", 
    "description": "The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.", 
    "genre": "research_article", 
    "id": "sg:pub.10.1023/a:1009769707641", 
    "inLanguage": [
      "en"
    ], 
    "isAccessibleForFree": false, 
    "isPartOf": [
      {
        "id": "sg:journal.1041853", 
        "issn": [
          "1384-5810", 
          "1573-756X"
        ], 
        "name": "Data Mining and Knowledge Discovery", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "3", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "2"
      }
    ], 
    "name": "Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values", 
    "pagination": "283-304", 
    "productId": [
      {
        "name": "readcube_id", 
        "type": "PropertyValue", 
        "value": [
          "80ca94e1f3e994798ee64caf2a3bffd5f3bbfdf815fd7dd72a6727507d4e6122"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1023/a:1009769707641"
        ]
      }, 
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1027035492"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1023/a:1009769707641", 
      "https://app.dimensions.ai/details/publication/pub.1027035492"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2019-04-10T18:25", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8675_00000537.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "http://link.springer.com/10.1023%2FA%3A1009769707641"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1023/a:1009769707641'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1023/a:1009769707641'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1023/a:1009769707641'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1023/a:1009769707641'


 

This table displays all metadata directly associated to this object as RDF triples.

124 TRIPLES      21 PREDICATES      47 URIs      19 LITERALS      7 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1023/a:1009769707641 schema:about anzsrc-for:08
2 anzsrc-for:0801
3 schema:author N5dcdd3f7e8f949bb8b267cb56b9abcd8
4 schema:citation sg:pub.10.1007/bf00114264
5 sg:pub.10.1007/bf00114265
6 sg:pub.10.1007/bf02293899
7 sg:pub.10.1007/bf02294153
8 sg:pub.10.1007/bf02294245
9 https://doi.org/10.1002/bs.3830120210
10 https://doi.org/10.1016/0020-0255(73)90043-1
11 https://doi.org/10.1016/0031-3203(79)90034-7
12 https://doi.org/10.1016/0031-3203(80)90001-1
13 https://doi.org/10.1016/0031-3203(87)90034-3
14 https://doi.org/10.1016/0031-3203(91)90022-w
15 https://doi.org/10.1016/0167-8655(95)00075-r
16 https://doi.org/10.1016/s0019-9958(69)90591-9
17 https://doi.org/10.1109/21.97475
18 https://doi.org/10.1109/34.9111
19 https://doi.org/10.1109/tpami.1983.4767409
20 https://doi.org/10.1109/tpami.1984.4767478
21 https://doi.org/10.1145/233269.233324
22 https://doi.org/10.2307/2344237
23 https://doi.org/10.2307/2528823
24 schema:datePublished 1998-09
25 schema:datePublishedReg 1998-09-01
26 schema:description The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
27 schema:genre research_article
28 schema:inLanguage en
29 schema:isAccessibleForFree false
30 schema:isPartOf Nc1e0f23920f7497fb18711e4bcdbe666
31 Ncd543f5e15c6408ea19e2ad43a39f70d
32 sg:journal.1041853
33 schema:name Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
34 schema:pagination 283-304
35 schema:productId N28ba08f2a5904214ab713c41609d3dcc
36 N4b952fc92b8443d4b0bccdaa66c7c769
37 N5bbc069ce1b74f20aa63577440e9dde4
38 schema:sameAs https://app.dimensions.ai/details/publication/pub.1027035492
39 https://doi.org/10.1023/a:1009769707641
40 schema:sdDatePublished 2019-04-10T18:25
41 schema:sdLicense https://scigraph.springernature.com/explorer/license/
42 schema:sdPublisher N851b2fa988fa4ab59414a42f0a410077
43 schema:url http://link.springer.com/10.1023%2FA%3A1009769707641
44 sgo:license sg:explorer/license/
45 sgo:sdDataset articles
46 rdf:type schema:ScholarlyArticle
47 N28ba08f2a5904214ab713c41609d3dcc schema:name doi
48 schema:value 10.1023/a:1009769707641
49 rdf:type schema:PropertyValue
50 N4b952fc92b8443d4b0bccdaa66c7c769 schema:name readcube_id
51 schema:value 80ca94e1f3e994798ee64caf2a3bffd5f3bbfdf815fd7dd72a6727507d4e6122
52 rdf:type schema:PropertyValue
53 N5bbc069ce1b74f20aa63577440e9dde4 schema:name dimensions_id
54 schema:value pub.1027035492
55 rdf:type schema:PropertyValue
56 N5dcdd3f7e8f949bb8b267cb56b9abcd8 rdf:first Ne341c607797642dd9843082039a53d08
57 rdf:rest rdf:nil
58 N851b2fa988fa4ab59414a42f0a410077 schema:name Springer Nature - SN SciGraph project
59 rdf:type schema:Organization
60 Na744eb1485834706a4495b7dce83ac3b schema:name ACSys CRC, CSIRO Mathematical and Information Sciences, GPO Box 664, 2601, Canberra, ACT, Australia
61 rdf:type schema:Organization
62 Nc1e0f23920f7497fb18711e4bcdbe666 schema:issueNumber 3
63 rdf:type schema:PublicationIssue
64 Ncd543f5e15c6408ea19e2ad43a39f70d schema:volumeNumber 2
65 rdf:type schema:PublicationVolume
66 Ne341c607797642dd9843082039a53d08 schema:affiliation Na744eb1485834706a4495b7dce83ac3b
67 schema:familyName Huang
68 schema:givenName Zhexue
69 rdf:type schema:Person
70 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
71 schema:name Information and Computing Sciences
72 rdf:type schema:DefinedTerm
73 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
74 schema:name Artificial Intelligence and Image Processing
75 rdf:type schema:DefinedTerm
76 sg:journal.1041853 schema:issn 1384-5810
77 1573-756X
78 schema:name Data Mining and Knowledge Discovery
79 rdf:type schema:Periodical
80 sg:pub.10.1007/bf00114264 schema:sameAs https://app.dimensions.ai/details/publication/pub.1000846536
81 https://doi.org/10.1007/bf00114264
82 rdf:type schema:CreativeWork
83 sg:pub.10.1007/bf00114265 schema:sameAs https://app.dimensions.ai/details/publication/pub.1017000685
84 https://doi.org/10.1007/bf00114265
85 rdf:type schema:CreativeWork
86 sg:pub.10.1007/bf02293899 schema:sameAs https://app.dimensions.ai/details/publication/pub.1008979876
87 https://doi.org/10.1007/bf02293899
88 rdf:type schema:CreativeWork
89 sg:pub.10.1007/bf02294153 schema:sameAs https://app.dimensions.ai/details/publication/pub.1040760134
90 https://doi.org/10.1007/bf02294153
91 rdf:type schema:CreativeWork
92 sg:pub.10.1007/bf02294245 schema:sameAs https://app.dimensions.ai/details/publication/pub.1012945757
93 https://doi.org/10.1007/bf02294245
94 rdf:type schema:CreativeWork
95 https://doi.org/10.1002/bs.3830120210 schema:sameAs https://app.dimensions.ai/details/publication/pub.1020292180
96 rdf:type schema:CreativeWork
97 https://doi.org/10.1016/0020-0255(73)90043-1 schema:sameAs https://app.dimensions.ai/details/publication/pub.1008244991
98 rdf:type schema:CreativeWork
99 https://doi.org/10.1016/0031-3203(79)90034-7 schema:sameAs https://app.dimensions.ai/details/publication/pub.1034901103
100 rdf:type schema:CreativeWork
101 https://doi.org/10.1016/0031-3203(80)90001-1 schema:sameAs https://app.dimensions.ai/details/publication/pub.1046710985
102 rdf:type schema:CreativeWork
103 https://doi.org/10.1016/0031-3203(87)90034-3 schema:sameAs https://app.dimensions.ai/details/publication/pub.1037415433
104 rdf:type schema:CreativeWork
105 https://doi.org/10.1016/0031-3203(91)90022-w schema:sameAs https://app.dimensions.ai/details/publication/pub.1006308944
106 rdf:type schema:CreativeWork
107 https://doi.org/10.1016/0167-8655(95)00075-r schema:sameAs https://app.dimensions.ai/details/publication/pub.1019728548
108 rdf:type schema:CreativeWork
109 https://doi.org/10.1016/s0019-9958(69)90591-9 schema:sameAs https://app.dimensions.ai/details/publication/pub.1015649493
110 rdf:type schema:CreativeWork
111 https://doi.org/10.1109/21.97475 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061122466
112 rdf:type schema:CreativeWork
113 https://doi.org/10.1109/34.9111 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061157242
114 rdf:type schema:CreativeWork
115 https://doi.org/10.1109/tpami.1983.4767409 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061741968
116 rdf:type schema:CreativeWork
117 https://doi.org/10.1109/tpami.1984.4767478 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061742013
118 rdf:type schema:CreativeWork
119 https://doi.org/10.1145/233269.233324 schema:sameAs https://app.dimensions.ai/details/publication/pub.1053685676
120 rdf:type schema:CreativeWork
121 https://doi.org/10.2307/2344237 schema:sameAs https://app.dimensions.ai/details/publication/pub.1103088593
122 rdf:type schema:CreativeWork
123 https://doi.org/10.2307/2528823 schema:sameAs https://app.dimensions.ai/details/publication/pub.1069974562
124 rdf:type schema:CreativeWork
 




Preview window. Press ESC to close (or click here)


...