Friday, September 22

Data Scientist Skill - Performing Linear Regression With Stata

Hi guys! Welcome to Data Scientist Skills blog. Today, I am going to share a basic statistics skill that data scientist use frequently. The basic regression statistics i.e linear regression. Stata is used to perform regression and results are interpreted accordingly. A built-in data set is used for the analysis.

Data Scientist Skill: Linear Regression With Stata
Technique: Linear Regression 
Tool: Stata

What is Stata?

Stata is a simple to use statistical software that helps in performing analytical operations. The tool was developed by StataCorp in 1985 and has been revised fifteen times since then. Few standard operations for which Stata is used are Basic tabulations and summaries, Time-series smoothers, Contrasts and comparisons, Power analysis, Linear regression, Generalized linear models (GLM), Cluster analysis, Case–control analysis, ARIMA, ANOVA, and MANOVA etc. Here, I have used Stata to perform linear regression using the built-in data set.

Linear Regression

Linear Regression is a mathematical approach used to understand the linear relationship between a result variable say Y and the affecting variable(s) say X1, X2, X3.....Xn. Here, Y is called the dependent variable and variables X1, X2, X3.....Xn are called independent variables. A regression line is drawn when the relationship is observed between Y and a single variable X. In this post, the regression shared is performed using only one independent variable. Hence, the equation will be

Y = b1 + (b2)X
 Here, the constants b2 and b1 are called estimators of the linear regression model and derived using the variable values present in the dataset.

Linear Regression in Stata 

Let's begin with the regression model. Here I have used the nslw88 dataset that captures the National Longitudinal Survey of Women data. It describes different employment parameters associated with working class of women of 1980's. Below are the steps you can follow to load the dataset to Stata.

Step 1 - Open Stata. Go to File option and click on Example datasets.

Example datasets

Step 2 - Select the "Example datasets installed with Stata" option and choose nslw88.dta file, by clicking on "use".

Example datasets installed with Stata

nlsw88 dataset

Step 3 - Browse the data and understand the variables. Select "Data" from the menu, drag over to "Data Editor" and select "Data Editor (Browse)".

Data Editor (Browse)

There are a lot of variables in the dataset that describe age, sex, marital status, graduation, city, industry, occupation, wages, tenure, hours of work and total work experience.


I am interested in knowing how the total work experience is affecting the wages of the female employees. To get a general idea, I will first plot a scatter graph between wage and total work experience (ttl_exp).

Step 4 - To plot the scatter diagram, select "Graphics" option from the menu and click on the first option "Two-way graph".

Twoway graph

Step 5 - Click on "Create", a dialog box with multiple options will appear. Choose the Y variable as wage, x variable as ttl_exp and keep the other things same. Click on "Submit".

Create Scatter Diagram in Stata

A scatter diagram showing the values of wage at Y axis with total work experience on X-axis will appear.

Wage vs total work experience

We can observe that the values of Y are increasing with increase in X. It suggests that there is an association between total work experience and wage. Let's run the linear regression model to find out more about their relationship.

Step 6 - Choose "Statistics" from the main menu, then select "Linear models and related" and click on "Linear Regression".

Linear Regression in Stata

Step 7 -  A dialog box will appear. Choose "wage" as dependent variable and "ttl_exp" as the independent variable. And click on "Submit".

Linear Regression results in Stata

You can now see the result of Linear Regression in Stata. The result is shared below.

Interpreting Linear Regression Results in Stata

Interpreting the Linear Regression Results in Stata

In the above image, we can see the coefficients b1 and b2. The coefficient of _cons is our b1 and the coefficient of ttl_exp is b2. b2 is 0.33 which tells us that one year of increase in education is associated with $0.33 increase in hourly wage. Do remember that this data is from 1980's and $0.33 was a considerable amount at that time. And b1 tells us that a woman with no education will get $3.6 as an hourly wage. Hence, our equation now becomes,

Y = 3.6 + 0.33 X

where Y is the hourly wage and X is the years of education.

We are now done with the linear regression for a single variable. You can increase the variables to get a better idea of the relationship between the wage and other variables. Linear regression is a simple and effective way to understand the relationship between variables which makes it an important data scientist skill.

No comments:

Post a Comment