K-Prototypes & Other Statistical Techniques to Cluster with Categorical and Numerical Features: A Deep Dive

4 min readJul 19, 2023

Introduction

Clustering is one of the most fundamental and ubiquitous tasks in data science and machine learning. It’s a form of unsupervised learning that involves grouping similar objects together based on their attributes. Clustering algorithms find the underlying structure in your data, revealing patterns, relationships, and categories that might not have been visible otherwise.

While there are a bunch of clustering algorithms available for use in different scenarios, not all clustering methods are created equal. Some algorithms are great for numerical data, while others are designed for categorical data. When faced with a dataset that has a mix of numerical and categorical features, you may find yourself in a bit of a bind. K-means, for example, a wildly popular clustering algorithm, is not designed to handle categorical data. This is where K-Prototypes enter the fray, providing an effective solution for clustering datasets with mixed data types.

Mixed Data Types

Traditional clustering algorithms like K-means calculate the distance between data points to determine their similarity. This works fine for numerical attributes where the concept of ‘distance’ is well defined. For categorical attributes, however, this method fails as the idea of a ‘distance’ between categories is not meaningful. For instance, how would you calculate the distance between ‘blue’ and ‘green’, or ‘apple’ and ‘banana’?

Another problem is that numerical and categorical features often have different scales and units of measurement. Without proper normalization, an algorithm might unduly emphasize one feature type over the other, leading to biased clustering results.

The K-Prototypes algorithm was specifically designed to overcome these issues, providing a robust method for clustering mixed data.

K-Prototypes: Bridging the Gap Between Numerical and Categorical Data

K-Prototypes, introduced by Huang in 1998, is an algorithm that gracefully handles a mixture of numerical and categorical data. The K-Prototypes algorithm does this by introducing a combined dissimilarity measure. For numerical attributes, the dissimilarity measure is the same as in K-means — the squared Euclidean distance. For categorical attributes, the measure is a simple matching dissimilarity measure, as in the K-modes algorithm.

A Python implementation of K-Prototypes is available in the kmodes library. You can use the KPrototypes class to create a K-Prototypes model and fit it to your data. Here's how you can use it:

from kmodes.kprototypes import KPrototypes
from sklearn.metrics import silhouette_score
import numpy as np

# Define the range of potential clusters and gamma values
clusters_range = range(2, 10)
gamma_range = np.linspace(0.1, 1, 10) # gamma values between 0.1 and 1

# Placeholder variables
best_score = -1
best_clusters = None
best_gamma = None

for n_clusters in clusters_range:
    for gamma in gamma_range:
        kproto = KPrototypes(n_clusters=n_clusters, gamma=gamma, init='Huang')
        clusters = kproto.fit_predict(df, categorical=[list_of_categorical_column_indices])
        score = silhouette_score(df, clusters)

        # Check if this configuration beats the best score
        if score > best_score:
            best_score = score
            best_clusters = n_clusters
            best_gamma = gamma

print(f"Best score: {best_score}")
print(f"Optimal number of clusters: {best_clusters}")
print(f"Optimal gamma value: {best_gamma}")

Deep Dive into K-Prototypes

The K-Prototypes algorithm begins by randomly assigning each data point to a cluster. Then, for each cluster, it calculates the centroid (for numerical features) and mode (for categorical features). Next, each data point is reassigned to the cluster whose centroid/mode is closest to it, according to the combined dissimilarity measure. The centroid/mode calculations and data point assignments are repeated until the clusters no longer change or the maximum number of iterations is reached.

The real beauty of the K-Prototypes algorithm lies in its combined dissimilarity measure. For numerical features, the squared Euclidean distance provides a good measure of how similar two data points are, with smaller distances indicating higher similarity. For categorical features, the simple matching dissimilarity measure counts the number of categories that do not match between two data points. When combined, these two measures give a comprehensive indication of similarity across both numerical and categorical features.

Beyond K-Prototypes: Other Techniques for Mixed Data Clustering

While K-Prototypes is a powerful algorithm for clustering mixed data types, it’s not the only one. Here are a few other techniques you might consider:

1. Gower’s Distance: Rather than modifying the clustering algorithm itself, another approach is to change how we measure distance or dissimilarity between data points. Gower’s distance is one such measure that can handle mixed data types.

2. TwoStep Clustering: This method was developed by SPSS. It first creates many small sub-clusters and then clusters these sub-clusters into the desired number of clusters.

3. Latent Class Analysis (LCA): This is a probabilistic model that calculates the probability of class membership for each observation, based on their feature values. Observations are assigned to the class for which they have the highest membership probability.

4. DBSCAN with Gower’s distance: DBSCAN is a popular density-based clustering algorithm. By default, it uses Euclidean distance, but it can be modified to use Gower’s distance for mixed data.

Wrapping Up

In a world where data is king, the importance of robust data analysis methods cannot be overstated. Clustering is one of the most important tools in our data analysis toolbox, and with methods like K-Prototypes and others mentioned here, we can tackle even the most complex, mixed data type datasets.

The journey of data exploration is filled with opportunities to unearth valuable insights and tell compelling stories. Whether you’re exploring customer behavior, analyzing medical records, or studying social networks, knowing the right tool for the job can make all the difference. So, the next time you’re faced with a mixed data type dataset, remember — there’s a clustering algorithm for that.