Application of Generalized additive models (GAMs) in detecting diabetes

Jacky Lim
4 min readJun 10, 2021

Logistic regression is proven to be powerful modelling tool in understanding relationship between individual features, Xⱼ and response variable, y. Nonetheless, it assumes that the relationship between independent variables and logit is linear and monotone. Generalized additive models (GAMs) are a way to model non-monotone response within the framework of logistic model while maintaining additivity. Because the model is additive, the effect of each independent variable can be individually examined, while holding all the other variables fixed, similar to logistic regression.

Below is the mathematical formulation of logistic regression GAM in binary classification problem:

fⱼ is the non-linear function (smoothing spline) of predictor j. It is these non-linear functions that allow the modeling of non-linear relationship between respective features and responses.

Let’s work through a simple example to see how it works. Additionally, we can get to see the difference in performance between logistic regression and GAM at the end of this demo.

First, load the required library packages in R.

library(mgcv) # library for GAM
library(ggplot2) # for beautiful plots
library(cdata) # data wrangling
library(sigr) # AUC calculation

Set up working directory where the data file is located. Split the data into both training (80%) and test data (20%), similar to the previous post.

data=read.csv(‘diabetes.csv’) # load the data into global #environmentset.seed(123) # for reproducibility
randn=runif(nrow(data))
train_idx=randn<=0.8
train=data[train_idx,]
test=data[!train_idx,]

gam() function in ‘mgcv’ package is used to train the GAM model. Since the relationship between each predictor and logit is not known beforehand, thus we use s() to denote basis function for all the attributes.

# formula: “Outcome==1” is to binarize the response variable
form_gam=as.formula(“Outcome==1~s(Pregnancies)+s(Glucose)+s(BloodPressure)+
s(SkinThickness)+s(Insulin)+s(BMI)+s(DiabetesPedigreeFunction)+
s(Age)”)
gam_model=gam(form_gam,data=train,family = binomial(link=”logit”))
gam_model$converged # check if the algorithm converges
summary(gam_model)

The smoothness of function fⱼ for variable Xⱼ can be summarized via edf in the above output. An edf close to one indicates that the variable has approximate linear relationship with the output. Predictors like pregnancies, blood pressure, skin thickness and insulin have edf of exactly one or close to one. Similar to logistic regression, the p-value can be used to evaluate the statistical significance of predictors. The “deviance explained” tells us the proportion of variance in logit output explained by the model — in this case, 36.6% of variance can be explained by the model.

We can always use plot(model) syntax to visualize the basis function mapping on each predictor. However, this syntax only allow plot of one smooth function on scale of linear predictor in one figure. The following code snippets gives another way of visualizing the basis function mapping on each predictor.

# Visualize s() output 
terms=predict(gam_model,type=”terms”)
terms=cbind(Outcome=train$Outcome,terms)
# convert terms into data frame
frame1=as.data.frame(terms)
# remove brackets
colnames(frame1)=gsub(‘[()]’,’’,colnames(frame1))
vars=setdiff(colnames(train),”Outcome”)
# remove first character ‘s’
colnames(frame1)[-1]=sub(“.”,””,colnames(frame1)[-1])
# Convert the wide to long form
library(cdata)
frame1_long=unpivot_to_blocks(frame1,nameForNewKeyColumn = “basis_function”,
nameForNewValueColumn = “basis_value”,
columnsToTakeFrom = vars)
# get the predictor values from training data
data_long=unpivot_to_blocks(train,nameForNewKeyColumn = “predictor”,
nameForNewValueColumn = “value”,
columnsToTakeFrom = vars)
frame1_long=cbind(frame1_long,value=data_long$value)
ggplot(data=frame1_long,aes(x=value,y=basis_value)) +
geom_smooth() +
facet_wrap(~basis_function,ncol = 3,scales = “free”)

The way to interpret the relationship between each individual predictor and response, logit is pretty straightforward, as shown in the above plot. For example, by referring to top left-hand panel of the figure, the log odds of getting diagnosed as diabetes increases from age 20 to around 45 and plateau, and finally declines after age 50. This interpretation is true, provided that all the predictors remain unchanged. Bear in mind the scale shown in the vertical scales of the plot and the p-values from the model summary should also be taken into consideration in the interpretation of GAM model.

# function to calculate the performance of model in terms of #accuracy, precision, recall and
# area under the curve (AUC). The curve refers to receiver operating #characteristics curve.
performance=function(y,pred){
confmat_test=table(truth=y,predict=pred>0.5)
acc=sum(diag(confmat_test))/sum(confmat_test)
precision=confmat_test[2,2]/sum(confmat_test[,2])
recall=confmat_test[2,2]/sum(confmat_test[2,])
auc=calcAUC(pred,y)
c(acc,precision,recall,auc)
}

It is always a good practice to write function for certain computational task that has to be done repeatedly. The ‘pred’ column in both ‘train’ and ‘test’ data frames contains the posterior probability, P(Y=1|X), where Y=1 refers to positive response (diabetes). Thus, arbitrary threshold can be selected for modeling trade-off if cost of misclassification for both classes is different.

# Posterior probability
train$pred=predict(gam_model,newdata = train,type = "response")
test$pred=predict(gam_model,newdata=test,type="response")
# model performance evaluated using training data
perf_train=performance(train$Outcome,train$pred)
perf_test=performance(test$Outcome,test$pred)
perf_mat=rbind(perf_train,perf_test)
colnames(perf_mat)=c("accuracy","precision","recall","AUC")
round(perf_mat,4)

The performance metrics for both logistic regression model (shown in previous post) and GAM model are tabulated as below. GAM performs slightly better in terms of accuracy, precision and AUC, yet worse than LR model in term of recall.

Performances of both LR model and GAM.

All in all, generalized additive models (GAMs) is a useful extension of linear models, which enable non-linear modeling (more flexibility) while retaining the model interpretability. The codes are available in Github and RPubs. An interactive app was developed using Rstudio ‘shiny’ package. Check it out in Github and this link.

--

--

Jacky Lim

Currently a lecturer in a private university in Malaysia