Exploratory customer personality analysis I: Data visualization

11 min readMay 25, 2022

Customer segmentation is the division of entire customer population into a set of diverse subgroups, each of them consist of customers with similar needs and characteristics. Companies could then come up with specific marketing strategies for each distinct customer segment, thereby enhance customer satisfaction and loyalty.

I came acroos this dataset from Kaggle and I think it could be a good starting point for customer segmentation project. The description of the attributes can be found on the aforementioned Kaggle platform.

The first part of this article would be focusing on data visualization whereas the second part would be more oriented towards cluster analysis and customer ranking. I would be using R for this experiment. Let’s dive in.

Data import

First, load the required R library packages.

library(tidyverse)  # Multiple packages for data analysis
library(visdat)     # visualize missing values
library(mice)       # multiple imputation of missing values
library(knitr)      # for tidy table display
library(GGally)     # for scatter plot matrix
library(ggmosaic)   # mosaic plot
library(reshape2)   # data wrangling
library(ggforce)    # for parallet set plots

Next, load the downloaded data into R workspace.

# Set the working directory to where the data file is located
setwd("~/ml projects/customer segmentation")
cust_data = read.csv("marketing_campaign.csv", sep = "\t")

There are 29 columns (variables) and 2240 rows (observations) as shown in the output below. Each row corresponds to a customer.

cust_data %>% str()
summary(cust_data)

Statistical summaries of some of the attributes (the full output can be found on RPubs).

Based on output of the summary() function, there are a few interesting things to take note:

There are NA's (missing values) for Income feature.
Most of the numerical features portrayed positive skewness.
Feature like ID is probably not useful as it does not encode any information about a customer. I will keep it in the DataFrame variable but not gonna use it for any subsequent analysis.
Feature like Z_CostContact and Z_Revenue are not useful attributes as they have zero variance.
Some feature engineering is required for feature like Dt_Customer and as they are not stored in the correct data types. This will be performed in the subsection below. Year_Birth can be transformed into more intuitive attribute like age of customer.
Feature restructuring from character to factor for features such as Education and Marital Status.

Feature engineering

Removal of zero variance variables

cust_data_copy = cust_data
cust_data = subset(cust_data, select = -c(Z_CostContact, Z_Revenue))

Get rid of invalid observations

table(cust_data$Marital_Status)   # tabulate the frequency of each marital status category
kable(cust_data %>% filter(Marital_Status == "YOLO"), caption = "YOLO instances")
cust_data = cust_data %>% filter(Marital_Status != "YOLO")

Feature extraction I

Derive the minimum number of household members from attributes Marital_Status, Kidhome and Teenhome.
Compute the total number of accepted offers by summing up the Response and five of the AcceptedCmp features.

# Calculate total number of accepted campaign, minimum number of household members.

min_num_member = function(x) {
  if (x[1]=="Married" || x[1]=="Together"){
    np = 2
  } else {
    np = 1
  }
  return(as.numeric(x[2]) + as.numeric(x[3]) + np)
}
min_num_household = apply(cust_data %>% 
                            select(Marital_Status, Kidhome, Teenhome), 
          1, min_num_member)
cust_data$min_num_household = min_num_household

cust_data = cust_data %>% mutate(tot_AcceptedCmp = AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5 + Response)

Restructuring of data types

# Change the date from string to date format
cust_data$Dt_Customer = as.Date(cust_data$Dt_Customer, 
                                     format = c("%d-%m-%Y"))

# Change the categorical features from character to factor
categorical_features = c("Education", "Marital_Status", "AcceptedCmp1",
                         "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4",
                         "AcceptedCmp5", "Complain", "Response")
for (i in 1:length(categorical_features)){
  cust_data[[categorical_features[i]]] = 
    as.factor(cust_data[[categorical_features[i]]])
}
# ordered categorical variables: Kidhome and Teenhome
cust_data$Kidhome = factor(cust_data$Kidhome, 
                           levels = seq(from = min(cust_data$Kidhome), 
                                        to = max(cust_data$Kidhome), 
                                        by = 1), ordered = TRUE)
cust_data$Teenhome = factor(cust_data$Teenhome, 
                           levels = seq(from = min(cust_data$Teenhome), 
                                        to = max(cust_data$Teenhome), 
                                        by = 1), ordered = TRUE)

Feature extraction II

Change the attribute Year_Birth to age and change the Dt_Customer to number of days a customer is enrolled with the company.

# Change Year_Birth to age
current_year = 2014
cust_data = cust_data %>% mutate(age = current_year - Year_Birth) %>% 
  subset(select = -c(Year_Birth))

# Change Dt_customer to days of enrollment. Assume that the current date is 1st July 2014
current_date = as.Date("2014-07-01")
cust_data = cust_data %>% 
  mutate(days_enroll = difftime(current_date, Dt_Customer, units = c("days")) %>% 
           as.numeric()) %>% 
  subset(select = -c(Dt_Customer))

Get rid of outliers

Remove equal to or more than 100 years old customers.
Discard instance with Income of more than 10 times the median of Income . Marketing department might need to devise exclusive marketing strategy on this customer.

summary(cust_data$age)
cust_data = cust_data %>% filter(age < 100)   # filter out age >=100# Unusually high Income 
cust_data %>% arrange(desc(Income)) %>% head(3) %>% kable()
cust_data = cust_data %>% filter(ID!=9432)   
# remove instance with ridiculously high income

Missing values imputation

There are primarily two ways to deal with missing values:

Complete case analysis (drop observation(s) with missing values).
Imputation.
- Single imputation.
- Multiple imputation.

Multiple imputation is generally superior to single imputation because the missing values are calculated multiple times with many different plausible values being pooled together to get the estimates. There are three types of missingness mechanisms: 1) missing completely at random (MCAR), 2) missing at random (MAR), 3) missing not at random (MNAR). More details on the types of missing data with intuitive explanation can be found here.

Generally, multiple imputation can often deals with MCAR and MAR, but not MNAR. If the missing data mechanism is MNAR, the missing pattern has to be modelled because the missing data do not depends on the observed data. For this experiment, Multiple Imputation by Chained Equations (MICE) is implemented to obtain the estimation of missing Income instances. For compact illustration of what is under the hood of MICE algorithm, please refer to this article.

# Write a utility function to check for missing values.
check_missing_values = function(cust_data){
  vec_missing = apply(cust_data, 2, function(x){ sum(any(is.na(x)))})
  idx_missing = which(as.vector(vec_missing)==1)
  if (length(idx_missing)>0){
    cat("The columns(features) with missing values: ", 
      colnames(cust_data)[idx_missing])
    
  } else {
    print("No missing values")
  }
  
}
check_missing_values(cust_data)

Another way to visualize the missing data is through vis_miss() function.

vis_miss(cust_data)

cat(" \nThe number of missing values for income: ", sum(is.na(cust_data$Income)))
idx_missing = which(is.na(cust_data$Income))

Perform multiple imputation (MICE implementation)

# MICE implementation
imp = mice(cust_data, maxit = 0, print = F)
pred = imp$predictorMatrix
pred[,"ID"] = 0
# Use the default value of 5 iterations
imp = mice(cust_data, seed = 10, predictorMatrix = pred, print = F)
stripplot(imp, Income~.imp, pch = 20, cex = 2)

One dimensional scatterplot for Income variable in each iteration. The red points represent the imputed missing values.

cust_data = complete(imp, 5)
kable(cust_data[idx_missing[1:5],] %>% select(ID:Income), caption = "Missing values imputed customer data")
# Show the first 5 instances whereby the missing values are imputed

As shown in the one dimensional scatterplot and the table above, the imputed annual household incomes hover below 10k¹.

Data visualization

Suppose you are one of the data scientist dealing with this dataset and your superior raised a few questions as below:

How are the Income attribute distributed among the customers?
What are the relationship between the numerical features? Is there any interesting data pattern or correlation worth noting, which can provide useful insight?
What are the relationship among the categorical features? (e.g. how the rate of Response (accepted the offer in the last campaign) varies with the presence of children and teenagers in the household?)
What are the relationship between categorical and numerical features? (e.g. what is the distribution of the amount spent on different merchandise under different educational background?)

All the listed questions above can be answered by leveraging appropriate set of data visualization tools.

Histogram / density plot

cust_data %>% ggplot(aes(x = Income)) + 
  geom_histogram(aes(y = stat(count)/sum(count)), binwidth = 10000, 
                 boundary = 0) +
  ylab("Relative frequency")

# density plot
cust_data %>% ggplot(aes(x = Income)) + geom_density(fill = "lightblue", alpha = 0.5) +
  geom_vline(aes(xintercept=mean(Income)), color = "blue", linetype = 2) + geom_vline(aes(xintercept=median(Income)), color = "red", linetype = 3) + xlim(0, 2e+05) + theme_classic() + ggtitle("density plot for customer annual household income")

Most of the customers (99.5 %) have income of equal or less than 100k. In fact, (90.91 %) of the customers have annual household income within 20k-90k. There is a heavy right tail for both the histogram and density plot, indicating customer with relatively high annual household income (There are 7 customers with Income more than 150k.

Correlogram / Correlation matrix

# Linear correlation and scatter plots matrix for amount spent on different merchandise
idx_cont_features = grepl("Mnt", colnames(cust_data))
cust_data[, idx_cont_features] %>% ggpairs(title = "correlogram")
# Linear correlation and scatter plots matrix for number of purchases.
idx_cont_features = grepl("Num", colnames(cust_data))
cust_data[, idx_cont_features] %>% ggpairs(title = "correlogram")

Scatter plot matrix + correlations for products-related features set.

Scatter plot matrix + correlations for “purchase behavior” features set.

Some useful information that we can quickly grasp from the the above plots are:

The amount spent on different pairs of product types are positively linear-correlated. (Note: the stars beside the correlations on the upper right matrix denote statistical significance. More info can be found in the GGally documentation.). This means that as the amount spent on a certain type of product increases, the amount spent on another type product increases as well².

2. Positive correlation between NumDealsPurchases and NumWebVisitsMonth . This may be attributed to the fact that customer get the incoming discount offer information by visiting the company website last month.

3. Negative correlation between NumWebVisitsMonth and NumStorePurchases . Customers who prefer to buy directly in stores visit the company website less frequently.

Lastly, the correlation matrix for all numerical attributes (including the derived features discussed in the feature engineering subsection) are as shown below:

seq = c("Mnt", "Num","Age", "days", "Income", "Recency", "household","tot")
a = sapply(seq, function(x) grep(x, colnames(cust_data)))

idx = c()
for (i in (1:length(seq))){
  idx = c(idx, a[[i]])
}

corr_mat = cor(cust_data[,idx])
corr_mat[upper.tri(corr_mat)] = NA
melted_cormat = melt(corr_mat, na.rm = TRUE)
# The dimension becomes nx3 matrix

melted_cormat %>% ggplot(aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") + 
  scale_fill_gradientn(limit = c(-1,1), colors = hcl.colors(40, "Earth")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 10, hjust = 1)) + coord_fixed() +
  geom_text(aes(Var1, Var2, label = round(value, 2)), color = "black", size = 1.5)

The strong positive linear correlation (0.72) of MntMeatProducts and NumCataloguePurchases (row 13, column 9) may suggest that large amount of spending made through catalogue are on meat product, compared to other products.

Mosaic Plot / Parallel set plot

These visualization tools can be used to investigate the relationship among categorical features.

cust_data %>% ggplot() +
  geom_mosaic(aes(x = product(Marital_Status, Education), fill = Response)) +
  theme(axis.text.x = element_text(face = "bold", angle = 90),
        axis.text.y = element_text(face = "bold")) +
  ggtitle("Mosaic Plot of Response, Marital status and Education")cust_data %>% ggplot() +
  geom_mosaic(aes(x = product(Kidhome, Teenhome), fill = Response)) +
  ggtitle("Mosaic Plot of Response, Kidhome and Teenhome")

By empirical observation on the mosaic plots above, we know that 1) the Response rate for “Married” and “Together” is generally less than “Single”, “Divorced” and “Widow”. 2) The probability of accepting offer is relatively higher for customer with no child and teenagers. In other word, the ratio of the area of blue rectangle to the area of the whole cell (in the lower left corner of the mosaic plot above) represents the conditional probability of accepting the offer given that the customer does not have child and teenager in his / her household. The parallel set plots can be found in the html documents posted on RPubs.

Violin Plots / Grouped box-plots

seq = c("Income", "Education", "Marital", "min")
idx_plot = sapply(seq, function(x) grep(x, colnames(cust_data)))

idx = c()
for (i in (1:length(seq))){
  idx = c(idx, idx_plot[[i]])
}
cust_data %>% dplyr::select(idx) %>% 
  ggplot(aes(x = Education, y = Income, fill = Education)) +
  geom_violin(scale = "count", draw_quantiles = c(0.25, 0.5, 0.75)) + stat_summary(fun = mean, geom = "point", shape=21, size=1.5, color = "red") + ylim(0, 2e+05)

Distribution of Customer annual household income under different educational backgrounds.

Distribution of Customer annual household income under different marital statuses.

Distribution of Customer annual household income under different minimum number of household members (household size).

According to the above violin plots,

Income for “Basic” education level is generally lower in contrast to other education levels.
Similar income for all marital statuses.
Customer with smaller household size has higher annual household income.

Grouped box plots are used to visualize the distribution of amount spent on different types of products for customers who belongs to different educational backgrounds, marital statuses and household size.

seq = c("Mnt", "Education", "Marital", "min")
idx_plot = sapply(seq, function(x) grep(x, colnames(cust_data)))
# Alternative, unlist() function could be used to flatten the nested list
idx = c()
for (i in (1:length(seq))){
  idx = c(idx, idx_plot[[i]])
}
cust_data %>% dplyr::select(idx) %>% 
  pivot_longer(starts_with("Mnt"), names_to = "purchase_types",
               values_to = "amount_spent") %>% 
  ggplot(aes(x = Education, y = amount_spent)) +
  geom_boxplot(fill = "chocolate2", notch = TRUE) + stat_summary(fun = mean, geom = "point", shape=21, size=1.5, color = "red") + theme_classic() + facet_wrap(~purchase_types, ncol = 2, scales = "free_y")

Lets look at the bottom right (“MntWines”) of top grouped box plots. The notch of the PhD boxplot does not overlap with other notches, indicating that the median of amount spent on wine for customer with Phd academic qualification is statistically different from other customers of lower education level. The red dots which represent the mean for different groups also show a higher amount spent on wines for PhD customers. Based on the bottom grouped box plots, the amount spent is lower for larger household size across all product types.

Final remarks

Data visualization is a very powerful and useful technique to get quick rough understanding on data pattern and ultimately leads to better decision. I would like to wrap up by quoting one of the sentence from this online article:

A good visualization tells a story, removing the noise from data and highlighting the useful information.

Next, we will move on to the cluster analysis and customer ranking, which will be the focus of the second part of this article.

All codes for this article is available on Github (‘customer segmentation’ folder) and RPubs. Thanks for reading. Have a nice day!

[1]: The dataset description on the Kaggle post does not reveal the currency for the Income variable.

[2]: To be foolproof, we shall not extend this move-in-tandem linear relationship to more than 2 features. For example, as A and B increase, C also increases. Correlation measures are to quantify linear relationship between pairs of variables. If relationship of 3 variables is to be examined, 3D scatter plot is a good choice.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com