Saturday, October 14

Data Scientist Skills - Which Regression Model to Use?

Regression is the simple and most optimistic machine earning algorithm that can help you understand the relationship between the predictors and dependent variable. A simple linear regression explains the dependency of predicted value (say Y) on each on the individual attribute (say X1, X2 .... Xn).
Linear Regression Output in Stata
Linear Regression Output in Stata

Regression Graph in Stata
Regression Graph in Stata

The above images show the linear regression performed in Stata to understand the variation in price of houses with change in different features associated with the house. The second image shows the regression line for variation of price of the house w.r.t living are of house.

In this post, I will be performing multiple linear regression to observe the variation in house of the price with change in different attributes which include living area in sq feet (livarea), number of bedrooms (beds), age of the house (age), number of bathrooms (baths) and if the house has a pool or not (pool). The variable pool, is a categorical variable and will be 1 if the house has a pool and 0 if house doesn't has a pool. Refer this post to know how to perform linear regression with Stata.

The dataset used is available publicly and has the data for 1500 houses sold in Stockton, CA during 1996 and 1998. You can download the excel file from here - Stockton4.xlsx.

Regression Models

I will run three models by transforming these variables and then we can compare to find which one is the best explanatory model. Lets see the results.

Model 1 - Includes square terms.
Model 2 - Includes interaction terms
Model 3 - Includes both square and interaction terms.

Model 1

Here are the results of multiple linear regression with square terms. Note that livarea*livarea is written as c.livarea#c.livarea in Stata. Model 1 includes the squares of variables living area, age of the house and number of bedrooms, other than the base variables.

Model 2

Below is the result from Model 2. In this model, we have interaction terms of livarea*age, livarea*beds, and age*beds along with base variables.

Model 3

Model 3 has all the variables, including the square terms and interaction variables.

Which Regression Model to Use?

Now we have these three models. How to decide which model to use? To choose a regression model we look at three things,

  • Adjusted R squared
  • Akaike Information Criterion (AIC)
  • Bayesian Information Criterion (BIC)
Adjusted R squared

It similar to R square which explains the regression interpreted by the model but value is decreased for adding irrelevant variables. Hence, the higher the adjusted R squared value the better is the regression model. Below is the formula for adjusted r squared.

The higher is the adjusted r squared, the better is the model with relevant variables.

Akaike Information Criterion (AIC)

Its a measure of regression explained by model with respect to number of variables includes in the model. The lower the value of AIC the better. Below is the formula for AIC.


SSE = Sum of Squared Error
N     = Number of observations
K     = Degree of freedom = Number of observations - Number of variables in model (including constant)

Bayesian Information Criterion

Bayesian Information Criterion (BIC) also known as Schwarz Criterion (SC), is similar to AIC but gives a heavy penalty by increasing the BIC value when a less significant variable is added to the model. Hence, the lower the BIC value the better it is to use the model.

For n>8, BIC is the most important criterion that should be considered. Even if adjusted R squared and AIC say model A is better but BIC is lower for model B, you should conclude that model B is better and should be chosen.

Now if we go back to our results, we see that Model 3 has the highest adjusted R squared value, lowest AIC and lowest BIC. Hence, without any second thought we conclude that regression model 3 should be used.

Note that, it is not necessary that Model 3 is the best of all possible models. We can transform the variables more to make more models and check their BIC values. But among the three models mentioned above, Model 3 is the best to use because of the discussed reasons. 

Sunday, September 24

Data Scientist Daily Task - What Do They Do?

Do you want to become a data scientist but have never had an opportunity to play that role? Were you always curious to know what a data scientist does all day? How he starts his day, what kind of project he/she completes or what all techniques are used in day to day life? Then this post is for you.

This is a small interview with a data scientist from AT&T about his day and life as a data scientist. This short video is shared RCR Wireless News, a quite popular youtube channel. In this video, Karthik Rajagopalan, who has completed his Ph.D. in Electrical Engineering from Arizona State University talks about the data scientist daily tasks and what all you need to get there.

AT&T Data Scientist - Karthik Rajagopalan

Below are the interview questions and answers:

Q1 - What does a Data Scientist do?

There is always a constant stream of getting data, looking at data, shaping data, coming up with different approaches. What am I going to create today? What sort of questions I have and what I am trying to answer. Which questions are relevant to get to the result? And hence for every problem, we have a solution approach that we lay down before starting any analysis.

Q2 - What is your background?

I have a background in engineering, I have degrees in mechanical, industrial and electrical. Basically a math-heavy background. My Ph.D. is in the area of solid state Physics, this area deals with working of semiconductors. Earlier, I used to make chips that go into the cell phones and the cell towers. It was mainly concerned with technology and making cellular communications happen. I was brought up in a math-heavy background that helped me get into data science.

Q3 - What does it take to be a data scientist?

Data science requires a lot of imagination and once you have the imagination the next thing you need is the math skills. The engineering and analysis habit that you develop comes in very handy when you do the data analysis. It's just that the scale is bigger.

Q4 - How do you stay on the cutting edge?

We read a lot, data scientist read a lot. We have to because a new tool comes out every day. There is a new technique to apply to solve a certain problem, a technique that has previously not been used. I compete in data science competitions, global competitions. Through these competitions, I interact a lot with people around the world who are doing data science. They serve as a platform to understand different applications of data science to different problems. Right now, I am working on a project where we are improving the digital experience of the customer. We look at customer care data, we see where the issues are. This helps us to fix the issues quickly and improve customer satisfaction.

Q5 - What tools do you use?

R is one of the basic tools that we use. Other than that we use Python a lot. Depending on the situation we sometimes also use the visualization tools that are out there in the market. But with R and Python, we pretty much take care of all what we need.

Q6 - Advice for aspiring data scientist?

Depends on the background the person has and where they are trying to go. They can move into data science, data development or the wrangling part of data. For data science, as I said, creativity is most important. Get to solve more problems on computational platforms, try to get internships and work as much as possible. You also need to have a math background. Other than that, be familiar with tools such as R and Python. These tools will help you play with data.

Key Takeaways

So this was a peek into the life of the AT&T data scientist. Hope you enjoyed it! Here are a few key takeaways -
The above four should definitely help you to get on the track, especially the last one, 😉. Do comment and let me know what you think.

Friday, September 22

Data Scientist Skill - Performing Linear Regression With Stata

Hi guys! Welcome to Data Scientist Skills blog. Today, I am going to share a basic statistics skill that data scientist use frequently. The basic regression statistics i.e linear regression. Stata is used to perform regression and results are interpreted accordingly. A built-in data set is used for the analysis.

Data Scientist Skill: Linear Regression With Stata
Technique: Linear Regression 
Tool: Stata

What is Stata?

Stata is a simple to use statistical software that helps in performing analytical operations. The tool was developed by StataCorp in 1985 and has been revised fifteen times since then. Few standard operations for which Stata is used are Basic tabulations and summaries, Time-series smoothers, Contrasts and comparisons, Power analysis, Linear regression, Generalized linear models (GLM), Cluster analysis, Case–control analysis, ARIMA, ANOVA, and MANOVA etc. Here, I have used Stata to perform linear regression using the built-in data set.

Linear Regression

Linear Regression is a mathematical approach used to understand the linear relationship between a result variable say Y and the affecting variable(s) say X1, X2, X3.....Xn. Here, Y is called the dependent variable and variables X1, X2, X3.....Xn are called independent variables. A regression line is drawn when the relationship is observed between Y and a single variable X. In this post, the regression shared is performed using only one independent variable. Hence, the equation will be

Y = b1 + (b2)X
 Here, the constants b2 and b1 are called estimators of the linear regression model and derived using the variable values present in the dataset.

Linear Regression in Stata 

Let's begin with the regression model. Here I have used the nslw88 dataset that captures the National Longitudinal Survey of Women data. It describes different employment parameters associated with working class of women of 1980's. Below are the steps you can follow to load the dataset to Stata.

Step 1 - Open Stata. Go to File option and click on Example datasets.

Example datasets

Step 2 - Select the "Example datasets installed with Stata" option and choose nslw88.dta file, by clicking on "use".

Example datasets installed with Stata

nlsw88 dataset

Step 3 - Browse the data and understand the variables. Select "Data" from the menu, drag over to "Data Editor" and select "Data Editor (Browse)".

Data Editor (Browse)

There are a lot of variables in the dataset that describe age, sex, marital status, graduation, city, industry, occupation, wages, tenure, hours of work and total work experience.


I am interested in knowing how the total work experience is affecting the wages of the female employees. To get a general idea, I will first plot a scatter graph between wage and total work experience (ttl_exp).

Step 4 - To plot the scatter diagram, select "Graphics" option from the menu and click on the first option "Two-way graph".

Twoway graph

Step 5 - Click on "Create", a dialog box with multiple options will appear. Choose the Y variable as wage, x variable as ttl_exp and keep the other things same. Click on "Submit".

Create Scatter Diagram in Stata

A scatter diagram showing the values of wage at Y axis with total work experience on X-axis will appear.

Wage vs total work experience

We can observe that the values of Y are increasing with increase in X. It suggests that there is an association between total work experience and wage. Let's run the linear regression model to find out more about their relationship.

Step 6 - Choose "Statistics" from the main menu, then select "Linear models and related" and click on "Linear Regression".

Linear Regression in Stata

Step 7 -  A dialog box will appear. Choose "wage" as dependent variable and "ttl_exp" as the independent variable. And click on "Submit".

Linear Regression results in Stata

You can now see the result of Linear Regression in Stata. The result is shared below.

Interpreting Linear Regression Results in Stata

Interpreting the Linear Regression Results in Stata

In the above image, we can see the coefficients b1 and b2. The coefficient of _cons is our b1 and the coefficient of ttl_exp is b2. b2 is 0.33 which tells us that one year of increase in education is associated with $0.33 increase in hourly wage. Do remember that this data is from 1980's and $0.33 was a considerable amount at that time. And b1 tells us that a woman with no education will get $3.6 as an hourly wage. Hence, our equation now becomes,

Y = 3.6 + 0.33 X

where Y is the hourly wage and X is the years of education.

We are now done with the linear regression for a single variable. You can increase the variables to get a better idea of the relationship between the wage and other variables. Linear regression is a simple and effective way to understand the relationship between variables which makes it an important data scientist skill.

Thursday, September 7

Data Scientist Skills - Foundation

Data science career track is one of the hot favorite career track currently in the market. Individuals from various fields are moving to data science roles. Data science is a broad domain that demands skills of various domains for job execution. A data scientist has to be good with maths, business, and technology. In this blog post, I will try to cover all the foundational skills required to be a good data scientist. If you want to become a data scientist and have these skills then you can move right ahead but if you feel some part of the foundation is a little weak, just work on it and it will ease your way to become a successful data scientist.

Data Scientist Skills

To become a proficient data scientist four major skills are required. These are problem-solving skills, analytical skills, coding skills and mathematical aptitude. Other than these there are other skills as well that will help you enhance your ability and add to your charm but being good at these five skills should definitely be your first target. Let's go through these skills one by one.

Problem Solving Skill

The most important skill that will help you become an expert data scientist, business analyst, data engineer or machine learning consultant is the skill of solving problems. Problems are an unavoidable part of any domain whether it may be any domain. Finance, Insurance, Health Care, Logistics, Supply Chain etc. all the industries have a particular way to function and understanding the abruptness in these functions makes you a good problem solver. As a data scientist, you can work in any domain, you will get the domain knowledge gradually along with experience but the skill of solving problems is the one you need to cultivate without depending on the domain specific issues. 

Even in the interviews, the interviewer tries to get a rough idea of your problem-solving skill by asking hypothetical or abrupt cases. Some interviewer might ask you to guess the number of atm transactions in your New York, some might ask you how would you move mount Fuji? and some can ask you the revenue earned by Wallmart in a day across US market. These problems are typically weird ones and answer to these problems is never accurate. But they do help to analyze the thinking process of the candidate. Good problem-solving skill should be the first, you should master if you are looking for data scientist job roles.

Mathematical Aptitude

Majority of the work a data scientist does involve rigorous use of mathematics especially statistics, algebra, calculus, and probability. As a data scientist, you will be understanding the business problem, its demand and which machine learning model will be the best fit as per the use case. To understand these model and algorithms you need to be familiar with the maths behind it. The derivations, assumptions, and equations. If you don't like maths than its going to be really challenging. 

Analytical Skills

Analytical skills in itself is a collection of a lot of sub skills. But here, we are mainly concerned with pattern recognition and identification. You should be able to scan the data and analyze it. You should be able to identify different patterns and relationships between entities in the data which is available for analysis. Let's take the example of a housing data that has different attributes such as house size, location, price, number of bedrooms etc. Just by going through the data you should get a rough idea which attributes are affecting the price the most. Which attributes are directly related to price and which of them are inversely proportional? Later on, you can move ahead and do regression analysis and use other algorithms to critically and accurately identify the colinearity, variance and other relationships among the different variables. Analysis of the data helps data scientist to identify deviations in data and produces insights that can be leveraged for business growth and development.

Coding Skills

Languages used by data scientist

Like all other IT people data scientist also use the power of machines to handle big data, complex computations, and visualizations. Hence, to interact with machines you need to know the language of computers. And for a data scientist, the most popular languages are R and Python. Using any of the two languages you can execute any of the machine learning algorithms on a limited amount of data. And if the size of data increases beyond a certain extent than big data technologies come into the picture. These are Scala, spark, hive, HBase etc. which are based on Hadoop distributed file system framework.  All these languages come handy when you are working with data. Hence, you need to have some coding skills to get comfortable with these languages. The better coder you are the more easily you will be able to grasp these languages.

All these four skills are the foundation data scientist skills, that will help you become a good data scientist. If you have these four skills, congrats! You can move ahead and take the next steps. But if you lack a few skills, don't worry. Just start working on them and in no time you will have a strong foundation of expert data scientist skills. Best of luck!