Linear regression and regularized regression: step by step example

Jacky Lim
7 min readJul 4, 2021

Linear regression (LR) is a staple method in statistical modeling, whereby the numerical output are predicted by linear combination of predictors with coefficients, βᵢ. Linear regression model has been extremely popular and go-to method for machine learning practitioners due to its simplicity and performs well in most problems, as long as linearity assumption is fulfilled. Regularization constrains or shrinks the coefficients estimates towards zero. Shrinking coefficient estimates can reduce variance.

At the end of this post, you will learn how to:

1) Develop linear regression model and regularized regression models (e.g. ridge, lasso and elastic net) in R.

2) Perform repeated k-fold cross validation (CV) to evaluate regression models’ performances.

Data acquisition and visualization

First, load the necessary packages and import the data (white wine data freely available on UCI machine learning repository and Kaggle) into Rstudio global environment. Next, use str() and summary() to display the attributes types and statistical summary of every variables. It is noted that the wine quality (response variable) is in integer form, in which high rating indicates high quality and vice versa.

library(ggplot2)
library(rsample) # stratified sampling
library(reshape2) # data wrangling
library(knitr)
library(car) # variance inflation factor
library(glmnet) # regularized linear regression
library(glmnetUtils)
library(gridExtra)
library(dplyr) # data wrangling
data=read.csv(‘winequality-white.csv’,sep = “;”)
str(data)
# statistical summary of each attribute
summary(data)

Perform stratified partitioning to split the data into training and test dataset.

set.seed(2)
split=initial_split(data,prop=0.8,strata = “quality”)
train=training(split)
test=testing(split)
ggplot(data=train,aes(x=quality)) +
geom_bar(fill=”steelblue”) + theme_classic() +
ggtitle(‘distribution of response variable in training dataset)

Most of the white wines have quality rating with 5 to 7, with small number of the sample are at low and high end.

It is always a good idea to compute the predictors’ correlation matrix and visualize it in a heatmap to know the collinearity (degree of correlation) between predictors.

vars=setdiff(colnames(train),’quality’)
corr=cor(train[,vars])
library(reshape2)
corr[upper.tri(corr)]=NA
melted_cormat=melt(corr,na.rm = TRUE)
# heatmap
ggplot(data=melted_cormat,aes(Var1,Var2,fill=value)) +
geom_tile(color=’white’) +
scale_fill_gradient2(low = “blue”, high=”red”,mid=”white”,midpoint = 0,
limit=c(-1,1),space = “Lab”,name=”Correlation”) +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45,vjust = 1,size = 12,hjust = 1)) +
coord_fixed() +
geom_text(aes(Var1,Var2,label=round(value,2)),color=”black”,size=3) +
xlab(“”) + ylab(“”)
Correlation matrix of predictors in training dataset.

There is strong positive linear correlation between ‘density’ and ‘residual sugar’ variables. ‘Density’ and ‘alcohol’ variables are negatively correlated.

Linear regression

The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response.

Gareth James, et. al. An Introduction to Statistical Learning : with Applications in R. New York :Springer, 2013. (pg. 99)

Collinearity reduces the accuracy of the estimates of the regression coefficients. Aside from correlation matrix, a more robust way is to calculate the variance inflation factor (VIF). As rule of thumb, VIF of more than 10 indicates high correlation.

Next, I fit the linear regression model using all the predictor and compute the VIF for each predictor.

# Train linear regression model
model_lm=lm(quality~.,data = train)
summary(model_lm)
kable(data.frame(vif=vif(model_lm)))
Variance inflation factor (VIF) for each predictor.

VIF of variables ‘density’ and ‘residual sugar’ are both greater than 10. Thus, one of the way to mitigate collinearity is to remove one of the predictors as shown below.

drops=’density’
model_lm=lm(quality~.,data=train[,!(colnames(train) %in% drops)])
summary(model_lm)
kable(data.frame(vif=vif(model_lm)))
VIF of all predictors are less than 5, indicating weak multicollinearity.

Dropping the variable ‘density’ result in changes of coefficient estimates, coefficient of determination (R-squared statistics) and residual standard error of the linear regression (LR) model. From the model summary, we know that ‘volatile acidity’ is negatively correlated to white wine quality, whereas ‘sulphates’ and ‘alcohol’ are positively correlated to white wine quality.

Regularized linear regression

Regularized linear regression can be implemented in R by using glmnet package. glmnet solves the following problem:

over a grid of values of λ covering the entire range of possible solutions.

Ridge regression

Ridge regression model can be trained by setting the input argument in ‘cv.glmnet’ function, alpha as 0. The standard linear regression coefficient estimates are scale invariant, but the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant. Thus, prior to ridge regression model training, we should standardize the predictors.

# First standardize the predictors
X_stand=scale(train[,vars],center = FALSE, scale = TRUE)
train_stand=data.frame(X_stand,quality=train$quality)
X1_stand=scale(test[,vars],center = FALSE, scale = attributes(X_stand)$’scaled:scale’)
test_stand=data.frame(X1_stand,quality=test$quality)
model_ridge=cv.glmnet(quality~.,data=train_stand,alpha = 0)
# visualize coefficients
coef_ridge=coef(model_ridge)
df_coef_ridge=data.frame(coef_name=rownames(coef_ridge)[-1], coef=coef_ridge[-1,1])
ggplot(data = df_coef_ridge,aes(x=coef_name,y=coef)) +
geom_bar(stat = ‘identity’, width=0.5) +
coord_flip()
# visualized cross-validation error with repect to different lambda
roundoff=function(x) sprintf(“%.3f”,x)
df_cvm_ridge=data.frame(lambda=model_ridge$lambda,cvm=model_ridge$cvm,
cvm_up=model_ridge$cvup,cvm_down=model_ridge$cvlo)
ggplot(data=df_cvm_ridge,aes(x=lambda,y=cvm,ymin=cvm_down,ymax=cvm_up)) +
geom_line() + geom_point() + geom_errorbar() + scale_x_continuous(trans = “log2”,labels = roundoff) +
geom_vline(xintercept = model_ridge$lambda.min, linetype=2, color=”red”) +
geom_vline(xintercept = model_ridge$lambda.1se, linetype=3, color=”blue”) +
ggtitle(‘ridge regression’)
Coefficient estimates corresponding to its respective predictors (ridge regression model).
Cross-validation error with respect to changes of λ, regularization parameter.

The function cv.glmnet() performs cross-validation to select the lambda that gives the minimum cross-validation error. When function like predict() are called, the cv.glmnet object by default uses the model of lambda.1se, which is the largest value of lambda such that the error is within 1 standard error of the minimum, lamda.min.

Lasso

Lasso regression model can be trained by setting the input argument in ‘cv.glmnet’ function, alpha as 1 as shown below.

model_lasso=cv.glmnet(quality~.,data=train_stand,alpha = 1)
# visualize coefficients
coef_lasso=coef(model_lasso)
df_coef_lasso=data.frame(coef_name=rownames(coef_lasso)[-1], coef=coef_lasso[-1,1])
ggplot(data = df_coef_lasso,aes(x=coef_name,y=coef)) +
geom_bar(stat = ‘identity’, width=0.5) +
coord_flip()
# visualized cross-validation error with repect to different lambda
# plus the number of nonzero coefficients
df_cvm_lasso=data.frame(lambda=model_lasso$lambda,cvm=model_lasso$cvm,
cvm_up=model_lasso$cvup,cvm_down=model_lasso$cvlo)
p1=ggplot(data=df_cvm_lasso,aes(x=lambda,y=cvm,ymin=cvm_down,ymax=cvm_up)) +
geom_line() + geom_point() + geom_errorbar() + scale_x_continuous(trans = “log2”,labels = roundoff) +
geom_vline(xintercept = model_lasso$lambda.min, linetype=2, color=”red”) +
geom_vline(xintercept = model_lasso$lambda.1se, linetype=3, color=”blue”) +
ggtitle(‘Lasso’)
df_nzero_lasso=data.frame(lambda=model_lasso$lambda,nzero=model_lasso$nzero)
p2=ggplot(data=df_nzero_lasso,aes(x=lambda,y=nzero)) +
geom_line(color=”steelblue”) + scale_x_continuous(trans = “log2”,labels = roundoff) +
ylab(‘Number of non-zero coefficients’)
grid.arrange(p1,p2,ncol=2)
Coefficient estimates corresponding to its respective predictors (lasso regression model).
Left: Cross-validation error with respect to changes of λ. Right: Number of non-zero coefficients with respect to changes of λ.

Elastic net

Elastic net is a compromise between ridge and lasso, in which the alpha is in between 0 and 1. I use cva.glmnet() function in glmnetUtils package to determine the best alpha, then employ cv.glmnet() by setting alpha as the best alpha.

elastic_net=cva.glmnet(quality~.,train_stand)# function to get cvm of the model$lambda.1se in each alpha
get_cvm=function(model) {
index <- match(model$lambda.1se, model$lambda)
model$cvm[index]
}
# data frame that contains alpha and its corresponding cross-validation error
enet_performance=data.frame(alpha=elastic_net$alpha)
models=elastic_net$modlist
enet_performance$cvm=vapply(models,get_cvm,numeric(1))
# get the best alpha and train the model
best_alpha=enet_performance[which.min(enet_performance$cvm),’alpha’]
model_enet=cv.glmnet(quality~.,train_stand,alpha = best_alpha)

As shown in the grouped bar plot above, the coefficient estimates for different regularization methods varied. It is important to note that regularization affect model interpretability, since we do not have standard error and p-value for coefficient estimates to evaluate its statistical significance. Lasso and elastic net regression can be regarded as some sort of variable selection. A larger lambda will tend to drive more coefficients’ estimates to zero.

The codes and the corresponding results for model evaluation using test set can be found in RPubs.

k-fold cross validation

Essentially, k-fold cross validation (CV) is a resampling method, which involves partitioning dataset into k distinct groups. A total of (k-1) groups of data instances are used to train the model, while the remaining one block of data is used to measure the model’s performance. This step is repeated k times. So, we would end up with k performance metrics, which enables us to calculate the standard deviation and construct confidence interval. Repeated k-fold CV simply means the k-fold CV procedure is repeated l times, where l is an arbitrary parameter. The codes can be found in Github and RPubs.

Summary

Root mean square error (RMSE), mean absolute error (MAE) and R-squared statistic are some commonly used performance metrics for regression. The lower the RMSE and MAE and the closer R-squared approaches 1, the better the performance. However, R2 can be misleading if the variance of error is large. The performances of each model evaluated under stratified sampling and repeated k-fold CV are tabulated as below:

Performances of regression models.

In this case, linear regression performs slightly better than regularized regression models. Low R² imply that there are more than 70% of variation of data left unexplained by linear models. Non-linear methods like neural networks and support vector machine are some good candidates to improve the prediction accuracy.

Reference

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547–553. ISSN: 0167–9236.

--

--

Jacky Lim

Currently a lecturer in a private university in Malaysia