Decision tree, random forest and gradient boosting: step-by-step example

Jacky Lim
7 min readJun 15, 2021

Decision tree is a popular supervised machine learning models for both classification and regression. In this post, I will be focusing on classification and continue to use the Pima Indians Diabetes dataset (as had been demonstrated in previous posts). Simply speaking, decision tree works by partitioning of predictors’ space into rectangular regions. So how do decision tree split the predictor? Intuitively, misclassification rate seems like a good choice, but it turns out to be less sensitive for tree growing compared to Gini index and entropy.

Up to this point, perhaps some of you may have came across two different formula for Gini index as shown below:

where pₖ is the proportion of samples from kᵗʰ class. Let’s work out the math to see if both will give the same result in the end. Assuming that K=2, thus p₂=1-p₁ at specific region (node).

Both equations are actually equivalent and I personally thinks that the second equation is more compact. It is also pretty obvious that if certain approach zero or one, Gini index would be low, which indicates high node purity.

With all that said, decision tree can be regarded as a greedy, top-down approach in which the best split is made according to the greatest reduction in Gini index rather than picking split that would results in better tree in future steps. Let’s move on to the example.

First, load the required library packages.

library(rpart) # decision tree model
library(rpart.plot) # decision tree diagram
library(wrapr) # formula
library(randomForest) # random forest model
library(GA) # genetic algorithm for hyperparameter tuning
library(rsample) # stratified sampling
library(xgboost) # gradient boosting
library(ggplot2) # beautiful plot
library(sigr) # AUC calculation

Next, load the data and split it into training and test data.

data=read.csv(‘diabetes.csv’)
vars=setdiff(colnames(data),”Outcome”)
set.seed(123)
randn=runif(nrow(data))
train_idx=randn<=0.8
train=data[train_idx,]
test=data[!train_idx,]

Train the decision tree model and plot the trained decision tree diagram.

# formula
data_formula=mk_formula(“Outcome”,vars)
treemodel=rpart(data_formula,train,method = “class”) #model #training
# Plot
rpart.plot(treemodel,type=5,extra=2,cex=0.65)

The interpretation is quite straightforward. Let’s follow the leftmost path of the decision tree diagram: If Glucose<102, then class=0 (non-diabetic). The 173/185 tells us that a total of 173 out of 185 training samples belong to class 0 (non-diabetic) in the terminal node. Each path of decision tree can be summarized by “If then” rule statements.

# function for model performance
performance=function(y,pred){
confmat_test=table(truth=y,predict=pred>0.5)
acc=sum(diag(confmat_test))/sum(confmat_test)
precision=confmat_test[2,2]/sum(confmat_test[,2])
recall=confmat_test[2,2]/sum(confmat_test[2,])
auc=calcAUC(pred,y)
c(acc,precision,recall,auc)
}
pred=predict(treemodel,newdata = train)[,2]
perf_train=performance(train$Outcome,pred)
pred=predict(treemodel,newdata=test)[,2]
perf_test=performance(test$Outcome,pred)
perf=rbind(perf_train,perf_test)
colnames(perf)=c("accuracy","precision","recall","AUC")
perf

Decision tree model suffers from overfitting (failure to generalize to new unseen data) as manifested by the discrepancies in the performance metrics. Fortunately, there are extensions of decision tree model, such as random forest and gradient boosting which can mitigate the high variance issue in decision tree model.

Random Forest

Random forest involves building large number of decision trees by applying two important strategies: 1) aggregate bagging and 2) decorrelate the decision tree by considering limited number of predictor for each splitting of predictor. I would not delve deep into the two strategies. For interested readers, please refer to this excellent textbook.

For the implementation of random forest model using R package ‘randomForest’, there are three meta-parameters (also called hyperparameters) that we should take note: 1) number of trees, B, 2) maximum number of terminal nodes, J, and 3) minimum size of terminal nodes, N.

using a very large value of B will not lead to overfitting. In practice, we use a value of B sufficiently large that the error has settled down. (pg. 321)

— “An introduction to statistical learning” by James et al.

Thus, we can set our B to be high and determine the appropriate J and N. Genetic algorithm (GA) is employed for choosing the optimal combination of J and N.

#function of random forest hyperparameter optimization
# x1 refers to maxnodes, x2 refers to nodesize
rf_param_opt=function(x) {
# further split the training dataset
split=initial_split(train,prop = 0.8,strata = “Outcome”)
train_split=training(split)
test_split=testing(split)
# Use the train_split to train with different combination x1 and #x2, then evaluate
# model with test_split. Set the ntree to a large number. x1 and x2 #must be integers
model_rf=randomForest(train_split[,vars],y=as.factor(train_split$Outcome),ntree=1000,
maxnodes = floor(x[1]),nodesize=floor(x[2]))
pred=predict(model_rf,test_split[,vars])
# accuracy
mean(test_split$Outcome==pred)
}
# GA parameter optimization
GA=ga(type=”real-valued”,fitness = rf_param_opt, lower = c(2,1),upper = c(50,10),
popSize = 10, maxiter = 100,run=20,seed=1)
summary(GA)
plot(GA)
Output of GA.
Changes in fitness values for each iteration.

From the output of GA, the best J is 26 while N is 5. After obtaining the optimal hyperparameters set, train the model and evaluate the model performance.

model_rf=randomForest(train[,vars],y=as.factor(train$Outcome),ntree=1000,importance = TRUE,
maxnodes = floor(GA@solution[1]),nodesize = floor(GA@solution[2]))
pred=predict(model_rf,train[,vars],type=”prob”)[,2]
perf_train=performance(train$Outcome,pred)
pred=predict(model_rf,test[,vars],type=”prob”)[,2]
perf_test=performance(test$Outcome,pred)
perf_rf=rbind(perf_train,perf_test)
colnames(perf_rf)=c(“accuracy”,”precision”,”recall”,”AUC”)
perf_rf

Overfitting is also evident in random forest. randomForest() function also comes in handy when it comes to variables’ importance computation.

varImpPlot(model_rf,type=1)

Gradient boosting

Similar to random forest, gradient boosting also involves constructing multiple decision trees but instead of using bootstrap sampling for training, decision trees in each iteration are grown based on the residuals of the previous decision trees.

There are 3 important user-defined parameters in the implementation of gradient boosted tree: 1) Number of boosting iterations, M, 2) shrinkage, v, and 3) number of terminal nodes, J.

Experience so far indicates that 4 ≤J≤ 8 works well in the context of boosting, with the results fairly insensitive to particular choices in this range. (pg. 363)

In fact the best strategy appears to be to set v to be very small (v <0.1) and then choose M by early stopping. (pg. 365)

— “The Elements of Statistical Learning” 2nd edition by Hastie et al.

Thus, v=0.005 is chosen and M is selected via cross-validation. The functions in R ‘xgboost’ package do not have the input argument of number of terminal nodes, J which characterizes the complexity of decision trees. However, it has ‘max_depth’ input argument which also serves the same purpose.

input=as.matrix(train[,vars])
cv=xgb.cv(input,label=train$Outcome,params = list(objective=”binary:logistic”,eta=0.005,max_depth=3),
nfold = 5,nrounds = 1000,print_every_n = 50,metrics = “logloss”)
evalframe=as.data.frame(cv$evaluation_log)
(ntree=which.min(evalframe$test_logloss_mean))
ggplot(evalframe,aes(x=iter,y=test_logloss_mean)) +
geom_line()

After selecting the best M via cross validation, train the gradient boosted tree and evaluate its performance.

model=xgboost(data=input,label=train$Outcome,params = list(objective=”binary:logistic”,eta=0.005,max_depth=3),
nrounds = ntree,verbose = FALSE)
pred=predict(model,input)
train_perf=performance(train$Outcome,pred)
test_input=as.matrix(test[,vars])
pred=predict(model,test_input)
test_perf=performance(test$Outcome,pred)
perf_gb=rbind(perf_train,perf_test)
colnames(perf_gb)=c(“accuracy”,”precision”,”recall”,”AUC”)
perf_gb

The features’ importance can be visualized as well using the ‘xgboost’ package.

# feature importance
importance_matrix=xgb.importance(vars,model=model)
xgb.plot.importance(importance_matrix ,rel_to_first = TRUE)
xgb.plot.shap(test_input, model = model, features = c(“Glucose”,”BMI”,”Age”))

The effect of the changes of Glucose, BMI and Age on the prediction of diabetes is as shown in above plot. The shapes of the plots is similar to the coefficients plots of GAM model in the previous post.

Conclusion

The performances of various classifiers are tabulated as above. This article is not intended to pinpoint the superiority of a certain model over the others. Strictly speaking, we can only have a better comparison statistically among classifiers if we can establish confidence intervals in the performance metrics and this can be achieved by sampling methods like cross-validation.

The more important takeaway from this exercise or experiment is that we know the key explanatory variables (Glucose, BMI, and age) to look at when it comes to predicting onset of diabetes. By referring to my previous posts, both the logistic regression and GAM also indicate that those variables mentioned are statistically significant based on the p-values. If a person’s plasma glucose concentration (obtained from oral glucose tolerance test) and BMI are both high and is at the age of 45, then we know that he/she is of high risk of being diabetic. The codes is available in Github and RPubs.

--

--

Jacky Lim

Currently a lecturer in a private university in Malaysia