Data Scientist Skill: Linear Regression With Stata
Technique: Linear Regression
What is Stata?
Stata is a simple to use statistical software that helps in performing analytical operations. The tool was developed by StataCorp in 1985 and has been revised fifteen times since then. Few standard operations for which Stata is used are Basic tabulations and summaries, Time-series smoothers, Contrasts and comparisons, Power analysis, Linear regression, Generalized linear models (GLM), Cluster analysis, Case–control analysis, ARIMA, ANOVA, and MANOVA etc. Here, I have used Stata to perform linear regression using the built-in data set.
Linear Regression is a mathematical approach used to understand the linear relationship between a result variable say Y and the affecting variable(s) say X1, X2, X3.....Xn. Here, Y is called the dependent variable and variables X1, X2, X3.....Xn are called independent variables. A regression line is drawn when the relationship is observed between Y and a single variable X. In this post, the regression shared is performed using only one independent variable. Hence, the equation will be
Y = b1 + (b2)XHere, the constants b2 and b1 are called estimators of the linear regression model and derived using the variable values present in the dataset.
Linear Regression in Stata
Let's begin with the regression model. Here I have used the nslw88 dataset that captures the National Longitudinal Survey of Women data. It describes different employment parameters associated with working class of women of 1980's. Below are the steps you can follow to load the dataset to Stata.
Step 1 - Open Stata. Go to File option and click on Example datasets.
Step 2 - Select the "Example datasets installed with Stata" option and choose nslw88.dta file, by clicking on "use".
Step 3 - Browse the data and understand the variables. Select "Data" from the menu, drag over to "Data Editor" and select "Data Editor (Browse)".
I am interested in knowing how the total work experience is affecting the wages of the female employees. To get a general idea, I will first plot a scatter graph between wage and total work experience (ttl_exp).
Step 4 - To plot the scatter diagram, select "Graphics" option from the menu and click on the first option "Two-way graph".
Step 5 - Click on "Create", a dialog box with multiple options will appear. Choose the Y variable as wage, x variable as ttl_exp and keep the other things same. Click on "Submit".
A scatter diagram showing the values of wage at Y axis with total work experience on X-axis will appear.
We can observe that the values of Y are increasing with increase in X. It suggests that there is an association between total work experience and wage. Let's run the linear regression model to find out more about their relationship.
Step 6 - Choose "Statistics" from the main menu, then select "Linear models and related" and click on "Linear Regression".
Step 7 - A dialog box will appear. Choose "wage" as dependent variable and "ttl_exp" as the independent variable. And click on "Submit".
You can now see the result of Linear Regression in Stata. The result is shared below.
Interpreting the Linear Regression Results in Stata
In the above image, we can see the coefficients b1 and b2. The coefficient of _cons is our b1 and the coefficient of ttl_exp is b2. b2 is 0.33 which tells us that one year of increase in education is associated with $0.33 increase in hourly wage. Do remember that this data is from 1980's and $0.33 was a considerable amount at that time. And b1 tells us that a woman with no education will get $3.6 as an hourly wage. Hence, our equation now becomes,
Y = 3.6 + 0.33 X
where Y is the hourly wage and X is the years of education.
We are now done with the linear regression for a single variable. You can increase the variables to get a better idea of the relationship between the wage and other variables. Linear regression is a simple and effective way to understand the relationship between variables which makes it an important data scientist skill.