Saturday, October 14

Data Scientist Skills - Which Regression Model to Use?

Regression is the simple and most optimistic machine earning algorithm that can help you understand the relationship between the predictors and dependent variable. A simple linear regression explains the dependency of predicted value (say Y) on each on the individual attribute (say X1, X2 .... Xn).
Linear Regression Output in Stata
Linear Regression Output in Stata

Regression Graph in Stata
Regression Graph in Stata

The above images show the linear regression performed in Stata to understand the variation in price of houses with change in different features associated with the house. The second image shows the regression line for variation of price of the house w.r.t living are of house.

In this post, I will be performing multiple linear regression to observe the variation in house of the price with change in different attributes which include living area in sq feet (livarea), number of bedrooms (beds), age of the house (age), number of bathrooms (baths) and if the house has a pool or not (pool). The variable pool, is a categorical variable and will be 1 if the house has a pool and 0 if house doesn't has a pool. Refer this post to know how to perform linear regression with Stata.

The dataset used is available publicly and has the data for 1500 houses sold in Stockton, CA during 1996 and 1998. You can download the excel file from here - Stockton4.xlsx.

Regression Models

I will run three models by transforming these variables and then we can compare to find which one is the best explanatory model. Lets see the results.

Model 1 - Includes square terms.
Model 2 - Includes interaction terms
Model 3 - Includes both square and interaction terms.

Model 1

Here are the results of multiple linear regression with square terms. Note that livarea*livarea is written as c.livarea#c.livarea in Stata. Model 1 includes the squares of variables living area, age of the house and number of bedrooms, other than the base variables.

Model 2

Below is the result from Model 2. In this model, we have interaction terms of livarea*age, livarea*beds, and age*beds along with base variables.

Model 3

Model 3 has all the variables, including the square terms and interaction variables.

Which Regression Model to Use?

Now we have these three models. How to decide which model to use? To choose a regression model we look at three things,

  • Adjusted R squared
  • Akaike Information Criterion (AIC)
  • Bayesian Information Criterion (BIC)
Adjusted R squared

It similar to R square which explains the regression interpreted by the model but value is decreased for adding irrelevant variables. Hence, the higher the adjusted R squared value the better is the regression model. Below is the formula for adjusted r squared.

The higher is the adjusted r squared, the better is the model with relevant variables.

Akaike Information Criterion (AIC)

Its a measure of regression explained by model with respect to number of variables includes in the model. The lower the value of AIC the better. Below is the formula for AIC.


SSE = Sum of Squared Error
N     = Number of observations
K     = Degree of freedom = Number of observations - Number of variables in model (including constant)

Bayesian Information Criterion

Bayesian Information Criterion (BIC) also known as Schwarz Criterion (SC), is similar to AIC but gives a heavy penalty by increasing the BIC value when a less significant variable is added to the model. Hence, the lower the BIC value the better it is to use the model.

For n>8, BIC is the most important criterion that should be considered. Even if adjusted R squared and AIC say model A is better but BIC is lower for model B, you should conclude that model B is better and should be chosen.

Now if we go back to our results, we see that Model 3 has the highest adjusted R squared value, lowest AIC and lowest BIC. Hence, without any second thought we conclude that regression model 3 should be used.

Note that, it is not necessary that Model 3 is the best of all possible models. We can transform the variables more to make more models and check their BIC values. But among the three models mentioned above, Model 3 is the best to use because of the discussed reasons.