Customer personality analysis II: Cluster analysis and customer ranking

12 min readMay 27, 2022

Machine learning is a broad field of study and can be roughly divided into several subdomains such as supervised learning, unsupervised learning, reinforcement learning and whatnot. Most of the associated books devoted majority of its content to supervised learning, because it is a well understood area. Supervised learning techniques often comes with clear goal (objective function or cost function) as well as systematic ways to evaluate the real-time performances of the trained models (e.g. accuracy, recall, precision and etc.). On the other hand, unsupervised learning refers to methods that reveal data pattern without any target or response variable. Unsupervised learning tends to be more subjective, and is part of exploratory data analysis most of the time¹.

As one of the branch of unsupervised learning, cluster analysis (CA) (a.k.a clustering or data segmentation) aims to discover hidden subgroups or segments in data, such that the data points in the same subgroup (cluster) are more alike and closely related to each other compared to data points assigned to other clusters. The notion of similarity / dissimilarity / proximity between pairs of data, which is the integral part of CA algorithm is typically measured by distance. The higher the distance between two instance, the lower the similarity measure. In our case, CA is employed to split the customers into several distinct homogeneous groups, with the purpose that different marketing campaigns and advertisements can be tailored for each customer group.

After a brief discussion of general concepts of clustering, let see how we can make use of CA to perform customer segmentation. Lastly, a multi-criteria decision analysis method, called Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) is proposed to rank the customers who belong to a specific cluster. Before jumping into the code, a kind reminder: the first part of this article can be found here.

Load library packages

library(tidyverse)
library(e1071)   # for skewness calculation
require(cluster)  # gower distance matrix and pam algorithm
library(fpc)      # cluster evaluation metrics
library(mdw)      # entropy-based feature weights
library(MCDA)      # TOPSIS
library(FactoMineR)  # Factor analysis for mixed data (FAMD)
library(factoextra)  # Visualization of factor analysis
library(knitr)       # For beatiful table display
library(car)         # For interactive 3D scatter plot
library(kableExtra)  # Custom html table styles

Import previous saved data

# set the working directory
setwd("~/ml projects/customer segmentation")
cust_data = readRDS("preprocessed_cust_data.rds")

Cluster Analysis (CA)

Dissimilarity matrix computation using Gower distance measure

Gower distance measures the dissimilarity of two data points with mixed numeric and categorical data. I am not gonna talk about the formula of Gower distance here but a simple example of Gower distance calculation can be found here.

For mixed data types daisy() function from cluster package can be used to calculate the pairwise dissimilarity between observations. type argument in the daisy function is used to specify the types of variables in the input cust_data_without_ID. More info regarding the daisy function can be obtained from the help() function or type in command ?daisy on console.

idx_col_retain = !(colnames(cust_data) %in% c("ID", "min_num_household", "tot_AcceptedCmp"))
cust_data_without_ID = cust_data[,idx_col_retain]

# Boolean attributes
seq_binary = c("Complain", "Response", "Accepted")
idx_binary = sapply(seq_binary, 
                    function(x) grep(x, colnames(cust_data_without_ID)))
idx_binary = unlist(idx_binary)

# continuos attributes
cont_features_patterns = c("Mnt", "Num", "Income","Recency", "age", 
                           "days_enroll") 
idx_cont = sapply(cont_features_patterns, 
                  function(x) grep(x, colnames(cust_data_without_ID)))
idx_cont = unlist(idx_cont)

skewness_col = apply(cust_data_without_ID[, idx_cont], 2, skewness)
idx_logtrans = idx_cont[which(abs(skewness_col)>1)]

# Ordinal attributes

dissimilarity_matrix = daisy(cust_data_without_ID, metric = "gower", 
                             type = list(ordratio=grep("home", colnames(cust_data)),
                                         asymm = idx_binary,
                                         logratio = idx_logtrans))

K-medoid clustering (PAM algorithm)

There are various clustering approaches, such as partitioning-based, hierarchical-based, density-based and model-based.

Similar to the popular k-means clustering, K-medoid clustering is a partitional clustering method, in which data points are separated into groups, or so-called clusters. The hyperparamater, number of clusters is chosen by user. However, k-medoid is generally more robust and less sensitive to outliers than k-means clustering. Why? The answers are 2-fold:

k-medoid chooses medoids (actual data points) as centroid (i.e. center of cluster, whereas k-means chooses the mean as centroid. This is similar to the analogy between median and mean, whereby median is more robust than mean as the breakdown value of median is 0.5 whereas the breakdown point of mean is 1/n. You can think of breakdown value as the resistance to outliers. The higher it is, the more robust is the statistical measure.
If we look at the cost functions of both clustering methods as shown below, we will notice that cost function of k-means is actually within cluster sum of squares, while distance function, d(x, z) for k-medoid is arbitrary. L2-norm cost function could be affected by outliers.

In this study, partitioning around medoids (PAM) algorithm² is chosen for k-medoid clustering. The algorithm turns out to be quite similar to k-means algorithm: 1) Initialize: randomly selects k data points as medoids, m 2) Assign every non-medoid points, o to the closest medoids, 3) For every m and o, swap m and o. Repeat step 2 to 3 until the cost function stop decreasing.

Selection of optimal number of clusters, k

After deciding on which clustering method to use, now the question is how do we measure the “goodness” of clusters generated for different k? The answer is through cluster validation indices (either internal or external).

Internal: Typical objective functions in clustering formalize the goal of attaining high intra-cluster similarity (compactness) and low inter-cluster similarity (separation)³. The similarity is numerical measure of how alike two data instances are normally quantified by distance measures (Euclidean distance). The higher the distance between two data points, the lower the similarity.
External: Ground truth or gold standard should be available to evaluate how well the clustering results match with the ground truth labels.

Since the data does not come with any annotated labels, thus, we are left with internal evaluation metrics. To select the best k, I will be using internal cluster evaluation metrics found in fpc package, such as within cluster sum of squares (elbow method), average silhouette width, Calinski-Harabasz (CH) index and Dunn index.

The best k can be chosen from the elbow of the plot of sum of squares versus k (point where the sum of squares starts to decline steadily); for the case of average silhouette width, CH index and Dunn index, we attempt to maximize the validation indices and thus over the range of k values experimented, the best k is when these indices are the largest. I am not showing the formula here, but for the sake of clarity, the key takeaway is that average silhouette width, CH index and Dunn index are scalar values derived by considering pairwise distance between data points. For interested readers who wish to dig deeper, the formula for these indices are listed in this paper.

# Possible number of clusters are assumed to be in the range of 2 to 8. 
k_array = seq(from = 2, to = 8, by=1)
cluster_eval_df = data.frame(matrix(, nrow = length(k_array), ncol = 4))
colnames(cluster_eval_df) = c("silhouette", "CH_index", "Dunn_index", "wc_SOS")
cluster_eval_df$k = k_array

for (i in (1:length(k_array))){
  set.seed(i+100)
  kmedoid = pam(dissimilarity_matrix, k = k_array[i], diss = TRUE,
                nstart = 10)
  # set diss to TRUE, and set the number of random start as 10.
  clust_stat = cluster.stats(dissimilarity_matrix, 
                             clustering = kmedoid$clustering)
  cluster_eval_df[i,"silhouette"] = clust_stat$avg.silwidth
  cluster_eval_df[i, "CH_index"] = clust_stat$ch
  cluster_eval_df[i, "Dunn_index"] = clust_stat$dunn
  # Add in the within cluster sum of squares
  cluster_eval_df[i, "wc_SOS"] = clust_stat$within.cluster.ss
}

# Line plot for all evaluation metrics (internal cluster evaluation)
cluster_eval_df %>% 
  pivot_longer(!k, names_to = "cluster_eval", values_to = "value") %>% 
  ggplot(aes(x = k, y = value)) +
  geom_line() + geom_point() + facet_grid(rows = vars(cluster_eval), 
                                          scales = "free_y")

Line plots of multiple cluster evaluation metrics for different number of clusters, k. The best k is 3, as suggested by CH index and Dunn index. The possible number of clusters are assumed to be within 2 to 8.

We proceed by running the PAM algorithm by passing the k argument as 3.

# find the best_k from stackoverflow using the mode function found on
# https://stackoverflow.com/questions/2547402/how-to-find-the-statistical-mode?page=1&tab=scoredesc#tab-top
mode = function(x){
  ux = unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
best_k = mode(c(which.max(cluster_eval_df$silhouette), 
                which.max(cluster_eval_df$CH_index),
                which.max(cluster_eval_df$Dunn_index))) + 1set.seed(100 + best_k - 1)  # for reproducibility
kmedoid = pam(dissimilarity_matrix, k = best_k, diss = TRUE, nstart = 10)   # set nstart as 10 for cluster stability
# Save the cluster label in the dataframe. Change it to factor to facilitate data wrangling.
cust_data$cluster_label = as.factor(kmedoid$clustering)

Statistical summaries of features for different customer segments

Number of customers in each cluster.

#1: Number of observations (customers)
cust_data %>% group_by(cluster_label) %>% 
  summarise(n = n()) %>% kable(caption = "Number of observations in each cluster")

2. Average of numerical features under each cluster

#2: Average of numerical features for each cluster
summary_cont_features_per_cluster = 
  cust_data %>% group_by(cluster_label) %>% 
  summarise_if(is.numeric, mean) %>% select(-ID)
summary_cont_features_per_cluster

Screenshot of R console output. The full table can be viewed on RPubs.

3. Distribution (proportion) of categorical features for each cluster.

#3: Distribution of categorical features for each cluster
cate_features_names = names(Filter(is.factor, cust_data))
# Change the second argument of group_by() and third argument aes() to desired categorical feature names: Education, Marital_status, Kidhome, Teenhome.
cust_data %>% select(one_of(cate_features_names)) %>% 
  group_by(cluster_label, Teenhome) %>% summarise(n = n()) %>% 
  mutate(prop = (n/sum(n))) %>% 
  ggplot(aes(x = cluster_label, y = prop, fill = Teenhome)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_text(aes(label = round(prop, 2)), vjust = 1.5, size = 2,
            position = position_dodge(0.9)) +
  theme_minimal()

4. Distribution of derived attributes (e.g. minimum number of household members) per cluster.

cust_data$min_num_household = factor(cust_data$min_num_household, ordered = TRUE)
cust_data %>% group_by(cluster_label, min_num_household) %>% summarise(n = n()) %>% 
  mutate(prop = (n/sum(n))) %>% 
  ggplot(aes(x = cluster_label, y = prop, fill = min_num_household)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_text(aes(label = round(prop, 2)), vjust = 1.5, size = 2,
            position = position_dodge(0.9)) +
  theme_minimal()

5. Tabulate binary(Boolean) variable for each cluster.

table1 = 
  cust_data %>% 
  group_by(cluster_label, Complain) %>% summarise(n = n()) %>% 
  mutate(prop = (n/sum(n))) %>% select(-n)
colnames(table1)[2] = "Binary_outcomes"

#pattern = c("Complain", "Response", "Accept", "cluster")
#idx_select = lapply(pattern, function (x) grep(x, colnames(cust_data)))
#idx = unlist(idx_select)
cust_data_1 = cust_data %>% 
  select(contains("Response") | contains("Accept") | contains("cluster"))

n_col = ncol(cust_data_1)
for (i in (1:(n_col-2))){
  table2 = cust_data_1 %>%  select(c(n_col, all_of(i))) %>% 
    group_by_all() %>% 
    summarise(n = n()) %>% 
    mutate(prop = (n/sum(n))) %>% 
    select(-n)
  colnames(table2)[2] = "Binary_outcomes"
  table1 = table1 %>% 
    left_join(table2, by = c("cluster_label" = "cluster_label", 
                             "Binary_outcomes" = "Binary_outcomes"))
}
oldname = colnames(table1)[3:length(colnames(table1))]
newname = c("Complain", 
            colnames(cust_data_1)[-length(colnames(cust_data_1))])
# Rename
for (i in (1:length(newname))){
  names(table1)[names(table1) == oldname[i]] = newname[i]
}
table1 %>% as_tibble()

Characteristics of each cluster inferred from the above statistical summaries are tabulated as below⁴:

Principal component analysis (PCA) for mixed data

The PCA is implemented primarily for dimensionality reduction so that we can visualize the data in lower dimensional space.

res.famd = FAMD(cust_data_without_ID, graph = F)

eig.value = get_eigenvalue(res.famd)
head(eig.value)
fviz_screeplot(res.famd)

# predict to get the transformed feature space and plot on 3 dimensional scatter plot
transformed_data = predict(res.famd, cust_data_without_ID)library(rgl)
scatter3d(x = transformed_data$coord[,1], y = transformed_data$coord[,2],
          z = transformed_data$coord[,3], 
          groups = cust_data$cluster_label, grid = FALSE, 
          surface = FALSE, ellipsoid = TRUE, 
          surface.col = c("#80FF00", "#009999", "#FF007F"))

A pop-up window will appear in Rstudio if you run the above code chunk and you can freely rotate the interactive 3D scatter plot and view the plot from different angles. If you open the R markdown codes in html format, please make sure that the WebGL is enabled on your browser. You can refer to this article on how to enable WebGL on different browsers.

Bear in mind that the scatter plots (data represented in the transformed dimensions) only represents roughly 30% of the variation of original data. As such, it should used with caution. Furthermore, there are high degree of overlapping among the 3 clusters.

Customer ranking for specific cluster (customer segment)

According to Pareto principle (a.k.a 80–20 rule), one fifth (20%) of the customers contribute about 80% to the revenue of a company. As such, identification of the most valuable customers is crucial for company to allocate more resources to foster a close relationship with these customers.

Technique for order preference by similarity (TOPSIS)

TOPSIS is one of the multi-criteria decision analysis (MCDA) tools used to rank or choose alternatives (customers) based on multiple criteria (chosen features set). Intuitively speaking, the alternatives that have short distance from the positive ideal solution and large distance from negative ideal solution should be awarded with high scores. You can find detailed TOPSIS algorithm description on this Wiki page, so I would skip the details here. The bottom lines when implementing TOPSIS are:

You can set the feature weights to prioritize certain attribute(s) over the others. If you were an expert that know the importance of each feature, you can have your own set feature weights. In this experiment, entropy weight method is performed. The main idea is the higher the degree of dispersion, the more information can be extracted and hence higher weight would be assigned to that particular feature.
The output scores range from 0 to 1, with value close to 1 being the best alternative.

Suppose that we are interested in ranking the customer in first cluster,

# Calculate feature weights for continuous and categorical features for first cluster
i = 1
data_analysis_cont = cust_data %>% filter(cluster_label==i) %>% dplyr::select(starts_with(c("Mnt", "Num")))
data_analysis_cate = cust_data %>% filter(cluster_label==i) %>% 
  dplyr::select(starts_with(c("Accepted", "Response")))

h_cont = get.bw(scale(data_analysis_cont), bw = "nrd", nb = 'na')
w_cont = entropy.weight(scale(data_analysis_cont), h = h_cont)
w_cate = entropy.weight(data_analysis_cate, h='na')

The above code snippet might take some time to execute.

data_analysis_cate <- lapply(data_analysis_cate, 
                             function(x) as.numeric(as.character(x)))
data_analysis = cbind(data_analysis_cont, data_analysis_cate)

feat_imp = c(w_cont, w_cate)
overall = TOPSIS(data_analysis, 
                 feat_imp, 
                 criteriaMinMax = rep("max", length(feat_imp)))
data_analysis$topsis_score = overall
data_analysis$ID = cust_data %>% filter(cluster_label==i) %>% select(ID)

# Top 10 customers for cluster 1
data_analysis %>% arrange(desc(topsis_score)) %>% as_tibble() %>% head(10) %>%  relocate(ID, topsis_score)

Screenshots of R console. The full table can be found on RPubs.

Conclusion

Lets briefly walk through the workflow of this experiment:

Data acquisition. Download the data and import it into R workspace.
Data preprocessing.
- Removal of zero variance variables.
- Discard outliers / invalid observations.
- Feature extraction.
- Missing values imputation.
Data visualization.
Cluster analysis by k-medoid clustering.
Customer ranking by TOPSIS.

There are some shortcomings of this customer segmentation approach that I would want to highlight:

This is not incremental learning. If we want to segment or group a new customer, we would need run the clustering model on the whole chunk of dataset (including the new datum).
The results might not be reproducible by other clustering methods. Cluster stability and consistency has always been an inherent issue in CA. For this experiment, I had employed partition-based clustering method, but there is no guarantee that this is the best method. A better approach would be to try other clustering methods that come with different assumptions and compare the clustering results.

It is worth noting that customer segmentation model should be monitored and evaluated constantly (iterative process) to ensure its effectiveness and robustness as the customers’ behaviors could evolve from time to time.

Lastly, all the codings in this article can be found on Github and RPubs. Thanks for reading!

[1]: James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). Introduction. In: An Introduction to Statistical Learning. Springer Texts in Statistics. Springer, New York, NY.

[2]: K-medoid problem is NP-hard and PAM algorithm is a heuristic solution.

[3]: Apart from conventional validation metrics that take into account the intra-cluster and inter-cluster similarity, cluster stability and prediction strength are some other viable options.

[4]: The characteristics of each cluster applies to majority of the cluster members, not all of them.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com