Prediction of credit card default: Predictive modelling (Part II)

Jacky Lim
13 min readNov 5, 2021

Before I start off with the explanation of the work flow to build predictive models, readers are encouraged to at least browse through the first part of this article, which discusses some data visualization tools to facilitate the understanding of the data.

Logistic regression (LR), regularized model (Lasso), generalized additive model (GAM), and k nearest neighbour (kNN) were implemented in this binary classification. All these classifiers (except kNN)¹ can readily output posterior probability of classes as estimated probability of default is more valuable than binary results from risk management point of view.

¹ kNN estimates the class conditional probability by calculating the proportion of each class in nearest neighbour set. Refer to this website for the formula.

Ok, now lets get straight to the point. First, load the required R packages. The codes to import the spreadsheet into R working environment and data splitting through stratified sampling has been shown in first part of this article, and will not be explained here.

library(readxl) # load excel files
library(cdata) # data wrangling
library(dplyr) # tidy up the data (data wrangling)
library(rsample) # stratified sampling
library(caret) # preprocessing (min-max normalization)
library(sigr) # for AUC calculation
library(knitr) # for table presentation
library(mgcv) # GAM model
library(glue) # for formula construction
library(WVPlots) # double density plots and ROC curves
library(car) # VIF calculation
library(glmnet) # Lasso model
library(glmnetUtils) # Lasso model
library(class) # kNN

Data transformation

Continuous features

Min-max normalization was applied on continuous features to standardize their scales. The resulting values of these predictors would range from [0,1].

# Select the continuous predictors
features_cont = colnames(train)[c(1, 5, c(12:23))]
train_cont = train[,features_cont]
# Min-max normalization
library(caret)
preprocess_min_max = preProcess(train_cont, method = c(“range”))
scaled_train_cont = predict(preprocess_min_max, train_cont)
train_scaled = cbind(scaled_train_cont, train[, features_cat], default = train$default)

Categorical features

Most machine learning model cannot deal with categorical features directly. In other words, they need to be transformed to numerical value prior to model training.

Lets take some time to review what its the underlying meaning in each value in the ‘monthly repayment status’ features. According to Kaggle data description, -1 denotes pay duly, whereas other positive integers denote payment delay for that particular month(s). Unfortunately, the underlying meaning behind -2 and 0 are unknown (I can’t find in any documentation). Thus, I treated -2 and 0 as two distinct categories. One hot encoding and ordinal encoding were used to encode this set of categorical features.

The following are figure and code snippet to show how categorical feature encoding for ‘monthly repayment status’ features is performed.

‘Repayment status’ feature encoding.
m_train = dim(train_scaled)[1]
# Function to encode montly payment status: one-hot encoding and ordinal encoding
my_cat_encode = function(predictor, ncol, col_names, m){
# data is the predictor (data frame); ncol is the number of columns
# m = dim(predictor)[1]; col_names is the names of column header
mat = matrix(, nrow = m, ncol = ncol)
# Custom encoding: one hot encoding + ordinal encoding
for (i in (1:m)){
data = predictor[i]
if (data == -2){
vec = c(1,0,0,0)
} else if (data == -1){
vec = c(0,1,0,0)
} else if (data == 0){
vec = c(0,0,1,0)
} else {
vec = c(0,0,0,data)
}
mat[i,] = vec
}
df_ps0 = data.frame(mat)
# colnames(df_ps0) = c(“neg2_pay0”, “neg1_pay0”, “zero_pay0”,”delay_months”)
colnames(df_ps0) = col_names
df_ps0
}
df = data.frame(matrix(,nrow = m_train, ncol = 0))
f_pay = c(“PAY_0”, “PAY_2”, “PAY_3”, “PAY_4”, “PAY_5”, “PAY_6”)
for (j in f_pay){
predictor = train_scaled[,j]
col_names = sapply(c(“neg2_{j}”, “duly_{j}”, “zero_{j}”, “delay_months_{j}”), glue)
names(col_names) = NULL
df1 = my_cat_encode(predictor, ncol = 4, col_names = col_names, m = m_train)
df = cbind(df, df1)
}
# concatenate original dataframe
train_new = cbind(train_scaled, df)
train_new = train_new[, !(colnames(train_new) %in% f_pay)]

Repeat the same data transformation pipeline on the test set.

# For test data
scaled_test_cont = predict(preprocess_min_max, test[,features_cont])
test_scaled = cbind(scaled_test_cont, test[, features_cat], default = test$default)
m_test = dim(test_scaled)[1]
# Repeat same pipeline as training data
df = data.frame(matrix(,nrow = m_test, ncol = 0))
f_pay = c(“PAY_0”, “PAY_2”, “PAY_3”, “PAY_4”, “PAY_5”, “PAY_6”)
for (j in f_pay){
predictor = test_scaled[,j]
col_names = sapply(c(“neg2_{j}”, “duly_{j}”, “zero_{j}”, “delay_months_{j}”), glue)
names(col_names) = NULL
df1 = my_cat_encode(predictor, ncol = 4, col_names = col_names, m = m_test)
df = cbind(df, df1)
}
test_new = cbind(test_scaled, df)
test_new = test_new[, !(colnames(test_new) %in% f_pay)]
cat(‘Training data no. of samples: ‘, dim(train_new)[1],
‘ ; no. of features: ‘, dim(train_new)[2])
cat(‘Test data no. of samples: ‘, dim(test_new)[1],
‘ ; no. of features: ‘, dim(test_new)[2])
Training set and test set sizes.

The transformed training set, train_new would be used to train logistic regression (LR), Lasso and generalized additive model (GAM).

Classifiers

Recall that the continuous predictors in this dataset is positively skewed. In most situations, skewed predictors do not affect the regression outcome. Nonetheless, skewed independent variables might increase the probability of high leverage points and complete separation or quasi-separation (see this link) as well as slightly degrades your logistic regression performance (see this link). In view of that, logarithmic transformation with the form ln(x+1) was performed on the continuous predictors.

variables = glue(‘log({features_cont}+1)’)
response = ‘default’
cat_var = setdiff(colnames(train_new), c(features_cont, response))
all_vars = c(variables, cat_var)
f = as.formula(
paste(response, paste(all_vars, collapse = “ + “), sep = “ ~ “)
)
print(f)
Formula as argument for glm() function in R.

Logistic regression (LR)

LR is one of the most popular and often the first go-to method among machine learning practitioners to deal with binary classification problems. The model’s output is the posterior probability estimates of positive class (y=1). Additionally, the significant coefficient estimates correspond to its contribution to the log odds (logit), which could be expressed mathematically:

where P(yhat=1|x) is probability estimates of positive class (in this case, default instance) given x while n is the number of predictors. The optimal combination of coefficient β can be found by minimizing the negative cross entropy.

where m is the number of training samples. After this brief intro to LR model, lets move on to the coding. For the sake of simplicity, only statistical significant coefficients are shown.

model_logreg = glm(f, data = train_new, family = binomial(link = “logit”))
summary(model_logreg)
# Show just the statistical significant coefficients
coef_mat = summary(model_logreg)$coefficients
coef_sig = coef_mat[coef_mat[,4]<=0.05,]
coef_sig
Statistical significant coefficients in LR model.

Next, we need to tackle collinearity problem mentioned in part I of this article as it could mask the effects of some independent variables. Furthermore, collinearity could reduce the accuracy of coefficient estimates by increasing the standard errors. Correlation matrix could help to pinpoint 2 correlated predictors, but it is possible for collinearity to exist between 3 or more variables (a situation called multicollinearity). Therefore, variance inflation factor (VIF) needs to be computed for each coefficient estimates. As a rule of thumb, VIF of more than 5 indicates strong collinearity.

VIF = vif(model_logreg)kable(data.frame(VIF[VIF[,3]>5,]), caption = “VIF for strongly correlated predictors”)

Consistent with the Pearson correlation coefficients, amount of bill statement in previous months display strong correlation. Other predictors that show high collinearity are encoded payment status. All encoded payment status for April, May and June in 2005 have high collinearity. To put it simply, there are high degree of uncertainty in coefficient estimates with high VIF.

The following are some findings after fitting and interpreting LR model:

  1. Increase in the amount of given credit, LIMIT_BAL and previous payments (last 2 months in particular), PAY_AMT1 and PAY_AMT2 would lower the likelihood of default.
  2. Single female with others or unknown educational background is less likely to default, in contrast to married male who finished his graduate school study, given all other features are identical. This is in line with the findings discussed in part I.
  3. The longer the delay in last month’s repayment results in higher chance of default, but the opposite happens when it comes to delay duration in August repayment.

To reiterate, all the interpretation of the effect of predictors is only valid if all the other independent variables stay the same.

Next comes the model evaluation phase using the transformed test data.

pred = predict(model_logreg, newdata = train_new, type = “response”)
pred_test = predict(model_logreg, newdata = test_new, type = “response”)
test$pred_logreg = pred_test
test$binary_default = (test$default==1)
(perf_train_logreg = performance(train_new$default, pred = pred))
(perf_test_logreg = performance(test_new$default, pred = pred_test))
Training and test performance metrics of LR model.

performance is a custom function to compute five binary classification performance metrics. Small discrepancies in the performance measures (between training and test set) suggests the absence of overfitting, or at worse mild overfitting.

Lasso

To reduce collinearity, we can use a technique called regularization. Lasso is a penalized regression method for shrinking regression coefficients towards zero. In fact, the coefficient estimates can be forced to be exactly zero due to L1-penalty imposed by minimizing

“When variables are nearly collinear, lasso regression tends to drive one or more of them to zero.” (Source: Practical Data Science with R, 2nd edition, pg. 263, 2020.)

model_lasso = cv.glmnet(f, data = train_new, alpha = 0, 
family = “binomial”, standardize = FALSE)
# summary(model_lasso)
coeff = coef(model_lasso)
coef_frame <- data.frame(coef = rownames(coeff)[-1],
value = coeff[-1,1])
ggplot(coef_frame, aes(x = coef, y = value)) +
geom_pointrange(aes(ymin = 0, ymax = value)) +
ggtitle(“Coefficients of lasso model”) +
coord_flip() +
theme(axis.text.y = element_text(size = 6))

Bear in mind that we do not know which independent variables are statistical significant. As such, we can’t quantify the actual effect of each separate predictor on the target variable. However, we could still make use of the signs of coefficient estimates to get a hint on correlation (positive/negative) between predictors and response. For instance, last month delay duration of payment delay_months_PAY_0 is positively correlated with log-odds of default. The performance of lasso is as shown:

Training and test performance of lasso.

GAMs

Generalized additive models (GAMs) are a way to model non-monotone response within the framework of logistic model while maintaining additivity. The additive assumption enables the inspection of the effect of each independent variable on the response variable, while holding all the other variables fixed, similar to logistic regression.

fⱼ is the non-linear function (smoothing spline) of predictor j. Below is the code snippets to train GAMs and visulize some non-linear patterns (effect of basis function to the log-odds).

# Construct the formula
variables = glue(‘s(log({features_cont}+1))’)
f_add = train_new %>% select(tail(names(.),24)) %>% colnames()
f_cat = c(“SEX”, “EDUCATION”, “MARRIAGE”)
all_vars = c(variables, c(f_add,f_cat))
# Change the response to binary variable (Boolean)
train_new$binary_default = (train_new$default==1)
f = as.formula(
paste(“binary_default”, paste(all_vars, collapse = “ + “), sep = “ ~ “)
)
print(f)
gam_model = gam(f, data = train_new, family = binomial(link = “logit”),
standardize = FALSE)
gam_model$converged # check the convergence of gam
GAMs formula as argument for gam() function.

Smoothing splines is fitted on all the continuous predictors.

# To Visualize some statistical significant variables
(sum_gam = summary(gam_model))
# Visualize s() outputs
terms = predict(gam_model, type=”terms”)
frame1 = as.data.frame(terms)df10= sum_gam$chi.sq
names(df10)
vars = names(sum_gam$chi.sq)[which(sum_gam$s.pv<0.05)]
vars
clean_vars = gsub(‘[()]’,’’,vars)
clean_vars = sub(“slog”,’’,clean_vars)
clean_vars = gsub(‘\\+’,’’,clean_vars)
clean_vars = gsub(‘\\s’,’’, clean_vars)
clean_vars = substr(clean_vars,1,nchar(clean_vars)-1)
frame1_visual = frame1[,vars]
train_gam_visual = train[,clean_vars]
# Long pivot
frame1_visual_long = unpivot_to_blocks(frame1_visual, nameForNewKeyColumn = “basis_function”,
nameForNewValueColumn = “basis_values”,
columnsToTakeFrom = vars)
train_gam_visual_long = unpivot_to_blocks(train_gam_visual,
nameForNewKeyColumn = “predictors”,
nameForNewValueColumn = “values”,
columnsToTakeFrom = clean_vars)
dat_visual = cbind(frame1_visual_long, train_gam_visual_long)dat_visual %>% ggplot(aes(x = values, y = basis_values)) +
geom_smooth() +
facet_wrap(~predictors, ncol = 3, scales = “free”)

The (linear / non-linear) relationship between significant predictor and log odds of default can be visualized through the above plots that show basis function mapping on the significant predictors. The log odds of credit default starts to drop rapidly (look at the vertical scale) after around 250,000 NT dollar for previous payment in August, 2005 PAY_AMT2, holding other predictors fixed.

The performance of GAMs is as shown:

Training and test performance of GAMs.

kNN

kNN is instance-based, memory-based and lazy learner, whereby no model has to be fitted. Given a previously unseen datum, x₀, k training points that are spatially nearest to x₀ are identified and x₀ is classified based on majority voting among the k neighbors. There are 2 important hyperparameters of kNN classifier: distance measures and number of nearest neighbors, k.

For this exercise, I used Euclidean distance measure, which is the default for knn function in class package. Optimal k is chosen by train-test validation approach². By the way, we need to perform one hot encoding on SEX, EDUCATION and MARRIAGE variables as kNN cannot handle categorical features.

² Generally, k-fold cross validation would be a better alternative. However this dataset is quite large which ramps the computational cost of kNN, which largely depends on the choice of k, and number of training samples, m. Please refer to this StackExchange post for more details.

dummy=dummyVars(“~.”,data=train_new[,f_cat])
cat_new=data.frame(predict(dummy,newdata=train_new[,f_cat]))
var_removed = c(“binary_default”, f_cat)
train_onehot = train_new[, !(colnames(train_new) %in% var_removed)]
train_onehot = cbind(cat_new, train_onehot)
# Test data
cat_new = data.frame(predict(dummy,newdata=test_new[,f_cat]))
test_onehot = test_new[, !(colnames(test_new) %in% var_removed)]
test_onehot = cbind(cat_new, test_onehot)
cat(‘Training data no. of samples (kNN): ‘, dim(train_onehot)[1],
‘ ; no. of features: ‘, dim(train_onehot)[2])
cat(‘Test data no. of samples (kNN): ‘, dim(test_onehot)[1],
‘ ; no. of features: ‘, dim(test_onehot)[2])
# Use train-test validation approach
set.seed(3)
split = initial_split(train_onehot, prop = 0.8, strata = “default”)
train_knn = training(split)
test_knn = testing(split)
target_train = train_knn$default==1
target_test = as.numeric(test_knn$default)-1
train_knn = train_knn[, !(colnames(train_knn) %in% c(“default”))]
test_knn = test_knn[, !(colnames(test_knn) %in% c(“default”))]
Training and test data sizes for kNN classifier.

As a rule of thumb, the best k is square root of m, which is approximately 155 (rounded up to odd integer), which provides clue to the range of k to look into. Since this data is of skewed class distribution, in which accuracy is potentially misleading, optimal k is searched by finding the maximum area under the receiver operating characteristics curve (AUC).

k_range = seq(from = 115, to = 175, by = 2)
AUC_vec = rep(NA, length(k_range))
iter = 0
for (K in (k_range)) {
iter = iter + 1
knn_pred = knn(train_knn, test_knn, target_train, k=K)
AUC_vec[iter] = calcAUC(ifelse(knn_pred==TRUE, 1, 0), target_test)
}
df_plot_k = data.frame(k = k_range, AUC = AUC_vec)
k_opt = df_plot_k[which.max(AUC_vec),1]
df_plot_k %>%
ggplot(aes(x = k, y = AUC)) +
geom_line() +
geom_vline(xintercept = k_opt, linetype = 2, color = ‘red’) +
xlab(‘Number of nearest neighbors, k’) +
ylab(‘AUC’)
The optimal k is 123.
knn_pred = knn(train_onehot, test_onehot, (train_onehot$default==1), 
k=k_opt, prob = TRUE)
prop_vote = attr(knn_pred,’prob’)
prob_pos = ifelse(knn_pred==FALSE, 1-prop_vote, prop_vote)
(perf_test_knn = performance(truth = test_new$default, pred = prob_pos))
Performance of kNN classifier.

Summary of findings

The performance of all the aforementioned classifiers are tabulated as below:

ROC curves for different classfiers.

kNN classifier clearly outperforms other classifiers in all performance metrics. Arguably, recall is one of the vital measures to watch out for as it is the indicator of the classifier’s ability to detect potential default case. The good news is we can manipulate the classifier’s score threshold to play around with the precision recall trade-off.

Double density plot for LR model.
Enrichment recall plot for LR model.

The above plots (it can be plotted for other models other than LR model) can be helpful to find the suitable threshold to obtain the desirable combination of recall and precision. What if we change the threshold to 0.2?

The recall for all the classifiers increase as we classify more instances to be default. The recall of kNN even leaps from 62.95% to 96.46%, which means it can detect about 96% of incoming default cases! All the codes can be found on Github and RPubs.

Lastly, I would like to answer the questions posted on this Kaggle page. Note that the following inferences are derived from the coefficient estimates (magnitudes, signs and p-values) outputs from LR and GAMs.

  1. How does the probability of default payment vary by categories of different demographic variables?
  • Gender: The probability of default payment for male > female.
  • Education background: The probability of default payment for graduate school > others and unknown categories.
  • Marriage: The probability of default payment for married > single.

2. Which variables are the strongest predictors of default payment?

  • Amount of previous payment in August and September, 2005 PAY_AMOUNT1 and PAY_AMOUNT2.
  • Amount of given credit LIMIT_BAL.
  • The longer the payment delay in September, 2005, the higher the probability of default payment.

Perhaps one of the shortcoming of this exercise is that the interaction between independent variables haven’t been studied. For instance, we cannot claim that married male with university-level education is more likely to default than single female with unknown education level. To visualize multiple categorical variables, interested readers can refer to this link.

Reference

Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473–2480.

--

--

Jacky Lim

Currently a lecturer in a private university in Malaysia