Sunday, September 24

Data Scientist Daily Task - What Do They Do?

Do you want to become a data scientist but have never had an opportunity to play that role? Were you always curious to know what a data scientist does all day? How he starts his day, what kind of project he/she completes or what all techniques are used in day to day life? Then this post is for you.

This is a small interview with a data scientist from AT&T about his day and life as a data scientist. This short video is shared RCR Wireless News, a quite popular youtube channel. In this video, Karthik Rajagopalan, who has completed his Ph.D. in Electrical Engineering from Arizona State University talks about the data scientist daily tasks and what all you need to get there.


AT&T Data Scientist - Karthik Rajagopalan




Below are the interview questions and answers:

Q1 - What does a Data Scientist do?

There is always a constant stream of getting data, looking at data, shaping data, coming up with different approaches. What am I going to create today? What sort of questions I have and what I am trying to answer. Which questions are relevant to get to the result? And hence for every problem, we have a solution approach that we lay down before starting any analysis.

Q2 - What is your background?

I have a background in engineering, I have degrees in mechanical, industrial and electrical. Basically a math-heavy background. My Ph.D. is in the area of solid state Physics, this area deals with working of semiconductors. Earlier, I used to make chips that go into the cell phones and the cell towers. It was mainly concerned with technology and making cellular communications happen. I was brought up in a math-heavy background that helped me get into data science.

Q3 - What does it take to be a data scientist?

Data science requires a lot of imagination and once you have the imagination the next thing you need is the math skills. The engineering and analysis habit that you develop comes in very handy when you do the data analysis. It's just that the scale is bigger.

Q4 - How do you stay on the cutting edge?

We read a lot, data scientist read a lot. We have to because a new tool comes out every day. There is a new technique to apply to solve a certain problem, a technique that has previously not been used. I compete in data science competitions, global competitions. Through these competitions, I interact a lot with people around the world who are doing data science. They serve as a platform to understand different applications of data science to different problems. Right now, I am working on a project where we are improving the digital experience of the customer. We look at customer care data, we see where the issues are. This helps us to fix the issues quickly and improve customer satisfaction.

Q5 - What tools do you use?

R is one of the basic tools that we use. Other than that we use Python a lot. Depending on the situation we sometimes also use the visualization tools that are out there in the market. But with R and Python, we pretty much take care of all what we need.

Q6 - Advice for aspiring data scientist?

Depends on the background the person has and where they are trying to go. They can move into data science, data development or the wrangling part of data. For data science, as I said, creativity is most important. Get to solve more problems on computational platforms, try to get internships and work as much as possible. You also need to have a math background. Other than that, be familiar with tools such as R and Python. These tools will help you play with data.

Key Takeaways

















So this was a peek into the life of the AT&T data scientist. Hope you enjoyed it! Here are a few key takeaways -
The above four should definitely help you to get on the track, especially the last one, 😉. Do comment and let me know what you think.

Friday, September 22

Data Scientist Skill - Performing Linear Regression With Stata

Hi guys! Welcome to Data Scientist Skills blog. Today, I am going to share a basic statistics skill that data scientist use frequently. The basic regression statistics i.e linear regression. Stata is used to perform regression and results are interpreted accordingly. A built-in data set is used for the analysis.

Data Scientist Skill: Linear Regression With Stata
Technique: Linear Regression 
Tool: Stata


What is Stata?


Stata is a simple to use statistical software that helps in performing analytical operations. The tool was developed by StataCorp in 1985 and has been revised fifteen times since then. Few standard operations for which Stata is used are Basic tabulations and summaries, Time-series smoothers, Contrasts and comparisons, Power analysis, Linear regression, Generalized linear models (GLM), Cluster analysis, Case–control analysis, ARIMA, ANOVA, and MANOVA etc. Here, I have used Stata to perform linear regression using the built-in data set.

Linear Regression


Linear Regression is a mathematical approach used to understand the linear relationship between a result variable say Y and the affecting variable(s) say X1, X2, X3.....Xn. Here, Y is called the dependent variable and variables X1, X2, X3.....Xn are called independent variables. A regression line is drawn when the relationship is observed between Y and a single variable X. In this post, the regression shared is performed using only one independent variable. Hence, the equation will be

Y = b1 + (b2)X
 Here, the constants b2 and b1 are called estimators of the linear regression model and derived using the variable values present in the dataset.


Linear Regression in Stata 


Let's begin with the regression model. Here I have used the nslw88 dataset that captures the National Longitudinal Survey of Women data. It describes different employment parameters associated with working class of women of 1980's. Below are the steps you can follow to load the dataset to Stata.

Step 1 - Open Stata. Go to File option and click on Example datasets.

Example datasets

Step 2 - Select the "Example datasets installed with Stata" option and choose nslw88.dta file, by clicking on "use".

Example datasets installed with Stata

nlsw88 dataset

Step 3 - Browse the data and understand the variables. Select "Data" from the menu, drag over to "Data Editor" and select "Data Editor (Browse)".

Data Editor (Browse)


There are a lot of variables in the dataset that describe age, sex, marital status, graduation, city, industry, occupation, wages, tenure, hours of work and total work experience.

Variables


I am interested in knowing how the total work experience is affecting the wages of the female employees. To get a general idea, I will first plot a scatter graph between wage and total work experience (ttl_exp).

Step 4 - To plot the scatter diagram, select "Graphics" option from the menu and click on the first option "Two-way graph".

Twoway graph

Step 5 - Click on "Create", a dialog box with multiple options will appear. Choose the Y variable as wage, x variable as ttl_exp and keep the other things same. Click on "Submit".

Create Scatter Diagram in Stata

A scatter diagram showing the values of wage at Y axis with total work experience on X-axis will appear.

Wage vs total work experience

We can observe that the values of Y are increasing with increase in X. It suggests that there is an association between total work experience and wage. Let's run the linear regression model to find out more about their relationship.

Step 6 - Choose "Statistics" from the main menu, then select "Linear models and related" and click on "Linear Regression".

Linear Regression in Stata

Step 7 -  A dialog box will appear. Choose "wage" as dependent variable and "ttl_exp" as the independent variable. And click on "Submit".

Linear Regression results in Stata

You can now see the result of Linear Regression in Stata. The result is shared below.

Interpreting Linear Regression Results in Stata

Interpreting the Linear Regression Results in Stata


In the above image, we can see the coefficients b1 and b2. The coefficient of _cons is our b1 and the coefficient of ttl_exp is b2. b2 is 0.33 which tells us that one year of increase in education is associated with $0.33 increase in hourly wage. Do remember that this data is from 1980's and $0.33 was a considerable amount at that time. And b1 tells us that a woman with no education will get $3.6 as an hourly wage. Hence, our equation now becomes,

Y = 3.6 + 0.33 X

where Y is the hourly wage and X is the years of education.

We are now done with the linear regression for a single variable. You can increase the variables to get a better idea of the relationship between the wage and other variables. Linear regression is a simple and effective way to understand the relationship between variables which makes it an important data scientist skill.

Thursday, September 7

Data Scientist Skills - Foundation

Data science career track is one of the hot favorite career track currently in the market. Individuals from various fields are moving to data science roles. Data science is a broad domain that demands skills of various domains for job execution. A data scientist has to be good with maths, business, and technology. In this blog post, I will try to cover all the foundational skills required to be a good data scientist. If you want to become a data scientist and have these skills then you can move right ahead but if you feel some part of the foundation is a little weak, just work on it and it will ease your way to become a successful data scientist.


Data Scientist Skills


To become a proficient data scientist four major skills are required. These are problem-solving skills, analytical skills, coding skills and mathematical aptitude. Other than these there are other skills as well that will help you enhance your ability and add to your charm but being good at these five skills should definitely be your first target. Let's go through these skills one by one.

Problem Solving Skill


The most important skill that will help you become an expert data scientist, business analyst, data engineer or machine learning consultant is the skill of solving problems. Problems are an unavoidable part of any domain whether it may be any domain. Finance, Insurance, Health Care, Logistics, Supply Chain etc. all the industries have a particular way to function and understanding the abruptness in these functions makes you a good problem solver. As a data scientist, you can work in any domain, you will get the domain knowledge gradually along with experience but the skill of solving problems is the one you need to cultivate without depending on the domain specific issues. 

Even in the interviews, the interviewer tries to get a rough idea of your problem-solving skill by asking hypothetical or abrupt cases. Some interviewer might ask you to guess the number of atm transactions in your New York, some might ask you how would you move mount Fuji? and some can ask you the revenue earned by Wallmart in a day across US market. These problems are typically weird ones and answer to these problems is never accurate. But they do help to analyze the thinking process of the candidate. Good problem-solving skill should be the first, you should master if you are looking for data scientist job roles.

Mathematical Aptitude


Majority of the work a data scientist does involve rigorous use of mathematics especially statistics, algebra, calculus, and probability. As a data scientist, you will be understanding the business problem, its demand and which machine learning model will be the best fit as per the use case. To understand these model and algorithms you need to be familiar with the maths behind it. The derivations, assumptions, and equations. If you don't like maths than its going to be really challenging. 

Analytical Skills


Analytical skills in itself is a collection of a lot of sub skills. But here, we are mainly concerned with pattern recognition and identification. You should be able to scan the data and analyze it. You should be able to identify different patterns and relationships between entities in the data which is available for analysis. Let's take the example of a housing data that has different attributes such as house size, location, price, number of bedrooms etc. Just by going through the data you should get a rough idea which attributes are affecting the price the most. Which attributes are directly related to price and which of them are inversely proportional? Later on, you can move ahead and do regression analysis and use other algorithms to critically and accurately identify the colinearity, variance and other relationships among the different variables. Analysis of the data helps data scientist to identify deviations in data and produces insights that can be leveraged for business growth and development.

Coding Skills


Languages used by data scientist

Like all other IT people data scientist also use the power of machines to handle big data, complex computations, and visualizations. Hence, to interact with machines you need to know the language of computers. And for a data scientist, the most popular languages are R and Python. Using any of the two languages you can execute any of the machine learning algorithms on a limited amount of data. And if the size of data increases beyond a certain extent than big data technologies come into the picture. These are Scala, spark, hive, HBase etc. which are based on Hadoop distributed file system framework.  All these languages come handy when you are working with data. Hence, you need to have some coding skills to get comfortable with these languages. The better coder you are the more easily you will be able to grasp these languages.

All these four skills are the foundation data scientist skills, that will help you become a good data scientist. If you have these four skills, congrats! You can move ahead and take the next steps. But if you lack a few skills, don't worry. Just start working on them and in no time you will have a strong foundation of expert data scientist skills. Best of luck!