tag:blogger.com,1999:blog-76362119640970292922018-03-06T17:04:07.758-08:00Data Scientist SkillsNitesh Choudharynoreply@blogger.comBlogger4125tag:blogger.com,1999:blog-7636211964097029292.post-18586278105718087252017-10-14T18:46:00.001-07:002017-10-14T18:46:41.296-07:00Data Scientist Skills - Which Regression Model to Use?Regression is the simple and most optimistic machine earning algorithm that can help you understand the relationship between the predictors and dependent variable. A simple linear regression explains the dependency of predicted value (say Y) on each on the individual attribute (say X1, X2 .... Xn).<br /><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-VptyhRy3XZE/WeKHBOEfLxI/AAAAAAAAAdY/7wqjen0ZMNwE3t_VVTJ8KRDkKCFb0NTZACLcBGAs/s1600/Linear%2BRegression%2BOuput%2BStata.PNG" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img alt="Linear Regression Output in Stata" border="0" data-original-height="335" data-original-width="639" height="334" src="https://3.bp.blogspot.com/-VptyhRy3XZE/WeKHBOEfLxI/AAAAAAAAAdY/7wqjen0ZMNwE3t_VVTJ8KRDkKCFb0NTZACLcBGAs/s640/Linear%2BRegression%2BOuput%2BStata.PNG" title="Linear Regression Output in Stata" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Linear Regression Output in Stata</td></tr></tbody></table><br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-rea6PoECyOc/WeKHBEhbxmI/AAAAAAAAAdc/LAsKd_C6MDEVVAvgZ4kBw7EIeVFTZbMNwCEwYBhgL/s1600/Linear%2BRegression%2BLine%2BStata.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img alt="Regression Graph in Stata" border="0" data-original-height="559" data-original-width="763" height="292" src="https://1.bp.blogspot.com/-rea6PoECyOc/WeKHBEhbxmI/AAAAAAAAAdc/LAsKd_C6MDEVVAvgZ4kBw7EIeVFTZbMNwCEwYBhgL/s400/Linear%2BRegression%2BLine%2BStata.PNG" title="Regression Graph in Stata" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Regression Graph in Stata</td></tr></tbody></table><br />The above images show the linear regression performed in Stata to understand the variation in price of houses with change in different features associated with the house. The second image shows the regression line for variation of price of the house w.r.t living are of house.<br /><br />In this post, I will be performing multiple linear regression to observe the variation in house of the price with change in different attributes which include living area in sq feet (livarea), number of bedrooms (beds), age of the house (age), number of bathrooms (baths) and if the house has a pool or not (pool). The variable pool, is a categorical variable and will be 1 if the house has a pool and 0 if house doesn't has a pool. Refer this post to know how to <a href="http://www.datascientistskills.com/2017/09/performing-linear-regression-stata.html">perform linear regression with Stata</a>.<br /><br />The dataset used is available publicly and has the data for 1500 houses sold in Stockton, CA during 1996 and 1998. You can download the excel file from here - <a href="http://www.principlesofeconometrics.com/poe4/data/excel/stockton4.xlsx">Stockton4.xlsx</a>.<br /><br /><h3>Regression Models</h3><div><br /></div>I will run three models by transforming these variables and then we can compare to find which one is the best explanatory model. Lets see the results.<br /><br />Model 1 - Includes square terms.<br />Model 2 - Includes interaction terms<br />Model 3 - Includes both square and interaction terms.<br /><br /><b>Model 1</b><br /><br />Here are the results of multiple linear regression with square terms. <i>Note that livarea*livarea is written as c.livarea#c.livarea in Stata</i>. Model 1 includes the squares of variables living area, age of the house and number of bedrooms, other than the base variables.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/--WmqJeAFX4s/WeKxESpOYwI/AAAAAAAAAd8/PyG8lCqL6ToKZKvzcZhzUQk17pTHQ2v-QCLcBGAs/s1600/Regression%2BModel%2B1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="753" data-original-width="765" height="628" src="https://4.bp.blogspot.com/--WmqJeAFX4s/WeKxESpOYwI/AAAAAAAAAd8/PyG8lCqL6ToKZKvzcZhzUQk17pTHQ2v-QCLcBGAs/s640/Regression%2BModel%2B1.PNG" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><br /><b>Model 2</b><br /><br />Below is the result from Model 2. In this model, we have interaction terms of livarea*age, livarea*beds, and age*beds along with base variables.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-Zl0Q2EYgvzg/WeKxxfegp7I/AAAAAAAAAeE/INCOn02fYnsipAOLZjOVNJQHgFUlGLfbgCLcBGAs/s1600/Regression%2BModel%2B2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="716" data-original-width="783" height="584" src="https://3.bp.blogspot.com/-Zl0Q2EYgvzg/WeKxxfegp7I/AAAAAAAAAeE/INCOn02fYnsipAOLZjOVNJQHgFUlGLfbgCLcBGAs/s640/Regression%2BModel%2B2.PNG" width="640" /></a></div><br /><b>Model 3</b><br /><b><br /></b>Model 3 has all the variables, including the square terms and interaction variables.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-YgoV-vrxkIM/WeK1__BXlbI/AAAAAAAAAeU/NKwWCi5uFREglVtqw8mR7iitRzJHBDlmgCLcBGAs/s1600/Regression%2BModel%2B3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="869" data-original-width="1129" height="492" src="https://3.bp.blogspot.com/-YgoV-vrxkIM/WeK1__BXlbI/AAAAAAAAAeU/NKwWCi5uFREglVtqw8mR7iitRzJHBDlmgCLcBGAs/s640/Regression%2BModel%2B3.PNG" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><br /><h3><br />Which Regression Model to Use?</h3><div>Now we have these three models. How to decide which model to use? To choose a regression model we look at three things,</div><div><br /></div><div><ul><li>Adjusted R squared</li><li>Akaike Information Criterion (AIC)</li><li>Bayesian Information Criterion (BIC)</li></ul><div><b>Adjusted R squared</b></div></div><div><br /></div><div>It similar to R square which explains the regression interpreted by the model but value is decreased for adding irrelevant variables. Hence, the higher the adjusted R squared value the better is the regression model. Below is the formula for adjusted r squared.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-AJWzTZNKOvo/WeK3BylKycI/AAAAAAAAAeo/BoAquGiBoiUl6dRZSH0WM2KUoszA1aPqwCLcBGAs/s1600/Adjusted%2Br%2Bsquared.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="215" data-original-width="313" src="https://1.bp.blogspot.com/-AJWzTZNKOvo/WeK3BylKycI/AAAAAAAAAeo/BoAquGiBoiUl6dRZSH0WM2KUoszA1aPqwCLcBGAs/s1600/Adjusted%2Br%2Bsquared.png" /></a></div><div><br /></div>The higher is the adjusted r squared, the better is the model with relevant variables.<div><b><br /></b></div><div><b>Akaike Information Criterion (AIC)</b></div><div><br /></div><div>Its a measure of regression explained by model with respect to number of variables includes in the model. The lower the value of AIC the better. Below is the formula for AIC.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-7eBkhNOugtg/WeK58kPzDkI/AAAAAAAAAe4/djgJ3n4CAcAjlkmcIzMypNSAEmKXawMTgCLcBGAs/s1600/AIC_formula.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="67" data-original-width="262" height="101" src="https://1.bp.blogspot.com/-7eBkhNOugtg/WeK58kPzDkI/AAAAAAAAAe4/djgJ3n4CAcAjlkmcIzMypNSAEmKXawMTgCLcBGAs/s400/AIC_formula.PNG" width="400" /></a></div><div> </div><div><br /></div><div>SSE = Sum of Squared Error</div><div>N = Number of observations</div><div>K = Degree of freedom = Number of observations - Number of variables in model (including constant)</div><div><br /></div><div><b>Bayesian Information Criterion</b></div><div><br /></div><div>Bayesian Information Criterion (BIC) also known as Schwarz Criterion (SC), is similar to AIC but gives a heavy penalty by increasing the BIC value when a less significant variable is added to the model. Hence, the lower the BIC value the better it is to use the model.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-TmNod2paVe4/WeK6jVyrr1I/AAAAAAAAAfA/gDLEgSwuYTEQnySNhDHSDdPejIiWmuVMQCLcBGAs/s1600/BIC_formula.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="86" data-original-width="295" height="116" src="https://3.bp.blogspot.com/-TmNod2paVe4/WeK6jVyrr1I/AAAAAAAAAfA/gDLEgSwuYTEQnySNhDHSDdPejIiWmuVMQCLcBGAs/s400/BIC_formula.PNG" width="400" /></a></div><div>For n>8, BIC is the most important criterion that should be considered. Even if adjusted R squared and AIC say model A is better but BIC is lower for model B, you should conclude that model B is better and should be chosen.</div><div><br /></div><div>Now if we go back to our results, we see that Model 3 has the highest adjusted R squared value, lowest AIC and lowest BIC. Hence, without any second thought we conclude that regression model 3 should be used.</div><div><br /></div><div>Note that, it is not necessary that Model 3 is the best of all possible models. We can transform the variables more to make more models and check their BIC values. But among the three models mentioned above, Model 3 is the best to use because of the discussed reasons. </div><div> </div>Nitesh Choudharyhttps://plus.google.com/111131279020701441727noreply@blogger.com0tag:blogger.com,1999:blog-7636211964097029292.post-29662095463182949882017-09-24T19:27:00.000-07:002017-10-14T14:57:35.428-07:00Data Scientist Daily Task - What Do They Do?Do you want to become a data scientist but have never had an opportunity to play that role? Were you always curious to know what a data scientist does all day? How he starts his day, what kind of project he/she completes or what all techniques are used in day to day life? Then this post is for you.<br /><br />This is a small interview with a <b>data scientist from AT&T</b> about his day and life as a data scientist. This short video is shared <a href="https://www.youtube.com/channel/UCvQ5AF7ldp99HZX3iyFuD-Q" rel="nofollow" target="_blank">RCR Wireless News</a>, a quite popular youtube channel. In this video, <a href="https://www.linkedin.com/in/krajagopalan/" target="_blank"><b>Karthik Rajagopalan</b></a>, who has completed his Ph.D. in Electrical Engineering from Arizona State University talks about the <b>data scientist daily tasks</b> and what all you need to get there.<br /><h4><br />AT&T Data Scientist - Karthik Rajagopalan</h4><br /><iframe allowfullscreen="" frameborder="0" height="320" src="https://www.youtube.com/embed/EaptTxhh6sM" width="580"></iframe><br /><br />Below are the interview questions and answers:<br /><br />Q1 - <b>What does a Data Scientist do?</b><br /><br />There is always a constant stream of getting data, looking at data, shaping data, coming up with different approaches. What am I going to create today? What sort of questions I have and what I am trying to answer. Which questions are relevant to get to the result? And hence for every problem, we have a solution approach that we lay down before starting any analysis.<br /><br />Q2 - <b>What is your background?</b><br /><br />I have a background in engineering, I have degrees in mechanical, industrial and electrical. Basically a math-heavy background. My Ph.D. is in the area of solid state Physics, this area deals with working of semiconductors. Earlier, I used to make chips that go into the cell phones and the cell towers. It was mainly concerned with technology and making cellular communications happen. I was brought up in a math-heavy background that helped me get into data science.<br /><br />Q3 - <b>What does it take to be a data scientist?</b><br /><br />Data science requires a lot of imagination and once you have the imagination the next thing you need is the math skills. The engineering and analysis habit that you develop comes in very handy when you do the data analysis. It's just that the scale is bigger.<br /><br />Q4 - <b>How do you stay on the cutting edge?</b><br /><br />We read a lot, data scientist read a lot. We have to because a new tool comes out every day. There is a new technique to apply to solve a certain problem, a technique that has previously not been used. I compete in data science competitions, global competitions. Through these competitions, I interact a lot with people around the world who are doing data science. They serve as a platform to understand different applications of data science to different problems. Right now, I am working on a project where we are improving the digital experience of the customer. We look at customer care data, we see where the issues are. This helps us to fix the issues quickly and improve customer satisfaction.<br /><br />Q5 - <b>What tools do you use?</b><br /><br />R is one of the basic tools that we use. Other than that we use Python a lot. Depending on the situation we sometimes also use the visualization tools that are out there in the market. But with R and Python, we pretty much take care of all what we need.<br /><br />Q6 - <b>Advice for aspiring data scientist?</b><br /><br />Depends on the background the person has and where they are trying to go. They can move into data science, <a href="https://www.ibm.com/support/knowledgecenter/en/SSRTLW_8.5.1/com.ibm.datatools.project.dev.doc/topics/tdevprojcreate.html" target="_blank">data development</a> or the wrangling part of data. For data science, as I said, creativity is most important. Get to solve more problems on computational platforms, try to get internships and work as much as possible. You also need to have a math background. Other than that, be familiar with tools such as R and Python. These tools will help you play with data.<br /><br /><h4>Key Takeaways</h4><div align:left=""><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-KANNSspuyj8/Wcho4n0PnTI/AAAAAAAAAcI/madpBs34uxw5eaEXsGQ5v7V4aFy8h9r8gCLcBGAs/s1600/data-scientist-skills.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="234" data-original-width="571" src="https://2.bp.blogspot.com/-KANNSspuyj8/Wcho4n0PnTI/AAAAAAAAAcI/madpBs34uxw5eaEXsGQ5v7V4aFy8h9r8gCLcBGAs/s1600/data-scientist-skills.jpg" /></a></div></div><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />So this was a peek into the life of the AT&T data scientist. Hope you enjoyed it! Here are a few key takeaways -<br /><ul><li>Improve your math, as much as you can.</li><li><a href="https://www.datacamp.com/" target="_blank">Practice R and Python</a>, as much as you can.</li><li>Take part in competitions, as much as you can.</li><li>Subscribe to <a href="http://www.datascientistskills.com/" target="_blank">Data Scientist Skills</a> (DSS)</li></ul>The above four should definitely help you to get on the track, especially the last one, đŸ˜‰. Do comment and let me know what you think.Nitesh Choudharyhttps://plus.google.com/111131279020701441727noreply@blogger.com0tag:blogger.com,1999:blog-7636211964097029292.post-27286856503635000242017-09-22T01:12:00.000-07:002017-09-22T01:14:46.889-07:00Data Scientist Skill - Performing Linear Regression With StataHi guys! Welcome to <a href="http://www.datascientistskills.com/" target="_blank">Data Scientist Skills blog</a>. Today, I am going to share a basic statistics skill that data scientist use frequently. The basic regression statistics i.e linear regression. Stata is used to perform regression and results are interpreted accordingly. A built-in data set is used for the analysis.<br /><div><br /></div><div><b>Data Scientist Skill</b>: Linear Regression With Stata</div><div><b>Technique</b>: Linear Regression </div><div><b>Tool</b>: Stata</div><h4><br />What is Stata?</h4><div><br /></div><div><a href="https://www.stata.com/" target="_blank">Stata </a>is a simple to use statistical software that helps in performing analytical operations. The tool was developed by StataCorp in 1985 and has been revised fifteen times since then. Few standard operations for which Stata is used are Basic tabulations and summaries, Time-series smoothers, Contrasts and comparisons, Power analysis, Linear regression, Generalized linear models (GLM), Cluster analysis, Caseâ€“control analysis, ARIMA, ANOVA, and MANOVA etc. Here, I have used Stata to perform linear regression using the built-in data set.</div><div><br /></div><h4>Linear Regression</h4><div><br /></div><div>Linear Regression is a mathematical approach used to understand the linear relationship between a result variable say Y and the affecting variable(s) say X1, X2, X3.....Xn. Here, Y is called the dependent variable and variables X1, X2, X3.....Xn are called independent variables. A regression line is drawn when the relationship is observed between Y and a single variable X. In this post, the regression shared is performed using only one independent variable. Hence, the equation will be</div><div><br /></div><blockquote class="tr_bq" style="text-align: center;"><b>Y</b> = b1 + (b2)<b>X</b></blockquote> Here, the constants b2 and b1 are called estimators of the linear regression model and derived using the variable values present in the dataset.<br /><h4><br />Linear Regression in Stata </h4><br />Let's begin with the regression model. Here I have used the nslw88 dataset that captures the National Longitudinal Survey of Women data. It describes different employment parameters associated with working class of women of 1980's. Below are the steps you can follow to load the dataset to Stata.<br /><br />Step 1 - Open Stata. Go to File option and click on Example datasets.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-c7rYWjJWA2w/WcSyjhiU2uI/AAAAAAAAAZM/ICLxrOm0afI7QyCGiInD0POu401bYm2WwCLcBGAs/s1600/Example%2Bdataset%2BStata.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Example datasets" border="0" data-original-height="829" data-original-width="1223" height="270" src="https://4.bp.blogspot.com/-c7rYWjJWA2w/WcSyjhiU2uI/AAAAAAAAAZM/ICLxrOm0afI7QyCGiInD0POu401bYm2WwCLcBGAs/s400/Example%2Bdataset%2BStata.PNG" title="Example datasets" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">Step 2 - Select the "Example datasets installed with Stata" option and choose nslw88.dta file, by clicking on "use".</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-z14eoBcNSPc/WcS0Y55QtXI/AAAAAAAAAZc/zNWjy39yhqUKsHMlAWGXAbo1-8u12uztwCLcBGAs/s1600/Select%2Bdataset%2Binstalled%2Bwith%2BStata.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Example datasets installed with Stata" border="0" data-original-height="836" data-original-width="1229" height="271" src="https://4.bp.blogspot.com/-z14eoBcNSPc/WcS0Y55QtXI/AAAAAAAAAZc/zNWjy39yhqUKsHMlAWGXAbo1-8u12uztwCLcBGAs/s400/Select%2Bdataset%2Binstalled%2Bwith%2BStata.PNG" title="Example datasets installed with Stata" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-Sj8rcLdkCcM/WcS0Y-m8qpI/AAAAAAAAAZY/yIGlIieDIv8Y7M1iFxH7a8EL7ae2aUybwCEwYBhgL/s1600/nlsw88%2Bdataset%2BStata.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="nlsw88 dataset" border="0" data-original-height="837" data-original-width="1227" height="272" src="https://3.bp.blogspot.com/-Sj8rcLdkCcM/WcS0Y-m8qpI/AAAAAAAAAZY/yIGlIieDIv8Y7M1iFxH7a8EL7ae2aUybwCEwYBhgL/s400/nlsw88%2Bdataset%2BStata.PNG" title="nlsw88 dataset" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">Step 3 - Browse the data and understand the variables. Select "Data" from the menu, drag over to "Data Editor" and select "Data Editor (Browse)".</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-0bGSCib-bYk/WcS2oLnPh0I/AAAAAAAAAZ4/2rFh9ZAfomMfI0VoAGAkKJkCC71xO7VnwCLcBGAs/s1600/Data%2BEditor%2B%2528Browse%2529.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Data Editor (Browse)" border="0" data-original-height="835" data-original-width="1227" height="271" src="https://1.bp.blogspot.com/-0bGSCib-bYk/WcS2oLnPh0I/AAAAAAAAAZ4/2rFh9ZAfomMfI0VoAGAkKJkCC71xO7VnwCLcBGAs/s400/Data%2BEditor%2B%2528Browse%2529.PNG" title="Data Editor (Browse)" width="400" /></a></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><br /></div>There are a lot of variables in the dataset that describe age, sex, marital status, graduation, city, industry, occupation, wages, tenure, hours of work and total work experience.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-LquqOqaLpTM/WcS9XLx4orI/AAAAAAAAAaQ/sMuWFN1Cu8QrThQBMEUVtXXIbv_833q5gCLcBGAs/s1600/Variables.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Variables" border="0" data-original-height="857" data-original-width="1233" height="277" src="https://2.bp.blogspot.com/-LquqOqaLpTM/WcS9XLx4orI/AAAAAAAAAaQ/sMuWFN1Cu8QrThQBMEUVtXXIbv_833q5gCLcBGAs/s400/Variables.PNG" title="Variables" width="400" /></a></div><br /><br />I am interested in knowing how the total work experience is affecting the wages of the female employees. To get a general idea, I will first plot a scatter graph between wage and total work experience (ttl_exp).<br /><br />Step 4 - To plot the scatter diagram, select "Graphics" option from the menu and click on the first option "Two-way graph".<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-ja_V5vQ_yEs/WcS-dEJPokI/AAAAAAAAAac/BNI36Bcxv7Q3aV_iYjNoCJjnZfpV3JYUgCLcBGAs/s1600/Twoway%2Bgraph.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Twoway graph" border="0" data-original-height="836" data-original-width="1236" height="270" src="https://1.bp.blogspot.com/-ja_V5vQ_yEs/WcS-dEJPokI/AAAAAAAAAac/BNI36Bcxv7Q3aV_iYjNoCJjnZfpV3JYUgCLcBGAs/s400/Twoway%2Bgraph.PNG" title="Twoway graph" width="400" /></a></div><br />Step 5 - Click on "Create", a dialog box with multiple options will appear. Choose the Y variable as wage, x variable as ttl_exp and keep the other things same. Click on "Submit".<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-BI074wndvVI/WcS_jRjwigI/AAAAAAAAAaw/7OFjiYfK2KEb3VnuqAVrc4eiXfMiAoMEQCLcBGAs/s1600/Scatter%2BDiagram%2BStata.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Create Scatter Diagram in Stata" border="0" data-original-height="855" data-original-width="1226" height="278" src="https://2.bp.blogspot.com/-BI074wndvVI/WcS_jRjwigI/AAAAAAAAAaw/7OFjiYfK2KEb3VnuqAVrc4eiXfMiAoMEQCLcBGAs/s400/Scatter%2BDiagram%2BStata.PNG" title="Create Scatter Diagram in Stata" width="400" /></a></div><br />A scatter diagram showing the values of wage at Y axis with total work experience on X-axis will appear.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-lBvYnGxSEdY/WcS_-2KNmuI/AAAAAAAAAa0/23BRY9UOGbgI1fq7i7uKCo8SiA6Aa1X0ACLcBGAs/s1600/total%2Bwork%2Bexperience%2Bvs%2Bwage.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Wage vs total work experience" border="0" data-original-height="622" data-original-width="726" height="342" src="https://2.bp.blogspot.com/-lBvYnGxSEdY/WcS_-2KNmuI/AAAAAAAAAa0/23BRY9UOGbgI1fq7i7uKCo8SiA6Aa1X0ACLcBGAs/s400/total%2Bwork%2Bexperience%2Bvs%2Bwage.PNG" title="Wage vs total work experience" width="400" /></a></div><br />We can observe that the values of Y are increasing with increase in X. It suggests that there is an association between total work experience and wage. Let's run the linear regression model to find out more about their relationship.<br /><br />Step 6 - Choose "Statistics" from the main menu, then select "Linear models and related" and click on "Linear Regression".<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-bQ88FLSyJsI/WcTBS9eV1kI/AAAAAAAAAbA/cb9TPvrJUt01IWaN9ZbPIzp43g8-fe3OgCLcBGAs/s1600/Linear%2BRegression%2Bin%2BStata.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Linear Regression in Stata" border="0" data-original-height="838" data-original-width="1227" height="272" src="https://4.bp.blogspot.com/-bQ88FLSyJsI/WcTBS9eV1kI/AAAAAAAAAbA/cb9TPvrJUt01IWaN9ZbPIzp43g8-fe3OgCLcBGAs/s400/Linear%2BRegression%2Bin%2BStata.PNG" title="Linear Regression in Stata" width="400" /></a></div><br />Step 7 - A dialog box will appear. Choose "wage" as dependent variable and "ttl_exp" as the independent variable. And click on "Submit".<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-hexLabZoxeo/WcTB0U9xY9I/AAAAAAAAAbI/28uL9OlWRp8HiLAPXnOK10gE-RyCp9uaQCLcBGAs/s1600/Linear%2BRegression%2Bresults%2Bin%2BStata.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Linear Regression results in Stata" border="0" data-original-height="832" data-original-width="1232" height="270" src="https://2.bp.blogspot.com/-hexLabZoxeo/WcTB0U9xY9I/AAAAAAAAAbI/28uL9OlWRp8HiLAPXnOK10gE-RyCp9uaQCLcBGAs/s400/Linear%2BRegression%2Bresults%2Bin%2BStata.PNG" title="Linear Regression results in Stata" width="400" /></a></div><br />You can now see the result of Linear Regression in Stata. The result is shared below.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-cR_sQadmgYE/WcTCcyGlScI/AAAAAAAAAbY/uPW5mW3WKpQIy1ihQcILw-yiMfj_JqBJACLcBGAs/s1600/Interpreting%2BLinear%2BRegression%2BResults%2Bin%2BStata.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Interpreting Linear Regression Results in Stata" border="0" data-original-height="288" data-original-width="646" height="178" src="https://4.bp.blogspot.com/-cR_sQadmgYE/WcTCcyGlScI/AAAAAAAAAbY/uPW5mW3WKpQIy1ihQcILw-yiMfj_JqBJACLcBGAs/s400/Interpreting%2BLinear%2BRegression%2BResults%2Bin%2BStata.PNG" title="Interpreting Linear Regression Results in Stata" width="400" /></a></div><br /><h4>Interpreting the Linear Regression Results in Stata</h4><br />In the above image, we can see the coefficients b1 and b2. The coefficient of _cons is our b1 and the coefficient of ttl_exp is b2. b2 is 0.33 which tells us that one year of increase in education is associated with $0.33 increase in hourly wage. Do remember that this data is from 1980's and $0.33 was a considerable amount at that time. And b1 tells us that a woman with no education will get $3.6 as an hourly wage. Hence, our equation now becomes,<br /><br /><div style="text-align: center;"><b>Y</b> = 3.6 + 0.33 <b>X</b></div><br />where Y is the hourly wage and X is the years of education.<br /><br />We are now done with the linear regression for a single variable. You can increase the variables to get a better idea of the relationship between the wage and other variables. Linear regression is a simple and effective way to understand the relationship between variables which makes it an important data scientist skill.<br /><br />Nitesh Choudharyhttps://plus.google.com/111131279020701441727noreply@blogger.com0tag:blogger.com,1999:blog-7636211964097029292.post-10904993920930544702017-09-07T00:04:00.000-07:002017-09-07T00:10:15.046-07:00Data Scientist Skills - Foundation<span style="font-family: inherit;">Data science career track is one of the hot favorite career track currently in the market. Individuals from various fields are moving to data science roles. <a href="https://en.wikipedia.org/wiki/Data_science" target="_blank">Data science</a> is a broad domain that demands skills of various domains for job execution. <b>A data scientist has to be good with maths, business, and technology</b>. In this blog post, I will try to cover all the foundational skills required to be a good data scientist. If you want to become a data scientist and have these skills then you can move right ahead but if you feel some part of the foundation is a little weak, just work on it and it will ease your way to become a successful data scientist.</span><br /><span style="font-family: inherit;"><br /></span><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-WCgW9IHGUQM/WbDsbmKEleI/AAAAAAAAAXs/FeurdklDsLUeRbGrtVtoPqMKDNQ80JtpQCLcBGAs/s1600/Data%2BScientist%2BSkills.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Data Scientist Skills" border="0" data-original-height="327" data-original-width="750" height="173" src="https://4.bp.blogspot.com/-WCgW9IHGUQM/WbDsbmKEleI/AAAAAAAAAXs/FeurdklDsLUeRbGrtVtoPqMKDNQ80JtpQCLcBGAs/s400/Data%2BScientist%2BSkills.jpg" title="Data Scientist Skills" width="400" /></a></div><span style="font-family: inherit;"><br /></span> <span style="font-family: inherit;"><br /></span> <span style="font-family: inherit;">To become a proficient <b>data scientist</b> four major skills are required. These are <b>problem-solving skills, analytical skills, coding skills and mathematical aptitude</b>. Other than these there are other skills as well that will help you enhance your ability and add to your charm but being good at these five skills should definitely be your first target. Let's go through these skills one by one.</span><br /><span style="font-family: inherit;"><br /></span><h3><span style="font-family: inherit;">Problem Solving Skill</span></h3><span style="font-family: inherit;"><br /></span><span style="font-family: inherit;">The most important skill that will help you become an expert data scientist, <a href="https://en.wikipedia.org/wiki/Business_analyst" target="_blank">business analyst</a>, <a href="https://cloud.google.com/certification/data-engineer" target="_blank">data engineer</a> or machine learning consultant is the skill of solving problems. Problems are an unavoidable part of any domain whether it may be any domain. Finance, Insurance, Health Care, Logistics, Supply Chain etc. all the industries have a particular way to function and understanding the abruptness in these functions makes you a good problem solver. As a data scientist, you can work in any domain, you will get the domain knowledge gradually along with experience but the skill of solving problems is the one you need to cultivate without depending on the domain specific issues. </span><br /><span style="font-family: inherit;"><br /></span> <span style="font-family: inherit;">Even in the interviews, the interviewer tries to get a rough idea of your problem-solving skill by asking hypothetical or abrupt cases. Some interviewer might ask you to guess the number of atm transactions in your New York, some might ask you how would you move mount Fuji? and some can ask you the revenue earned by Wallmart in a day across US market. These problems are typically weird ones and answer to these problems is never accurate. But they do help to analyze the thinking process of the candidate. Good <a href="https://www.thebalance.com/problem-solving-skills-with-examples-2063764" rel="" target="_blank">problem-solving skill</a> should be the first, you should master if you are looking for <a href="https://www.forbes.com/sites/louiscolumbus/2016/10/22/15-data-scientist-jobs-that-pay-100k-or-more/" target="_blank">data scientist job</a> roles.</span><br /><span style="font-family: inherit;"><br /></span><h3><span style="font-family: inherit;">Mathematical Aptitude</span></h3><span style="font-family: inherit;"><br /></span><span style="font-family: inherit;">Majority of the work a data scientist does involve rigorous use of mathematics especially <a href="https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about" target="_blank">statistics</a>, algebra, calculus, and probability. As a data scientist, you will be understanding the business problem, its demand and which <a href="https://dzone.com/articles/introduction-6-machine" target="_blank">machine learning model</a> will be the best fit as per the use case. To understand these model and algorithms you need to be familiar with the maths behind it. The derivations, assumptions, and equations. If you don't like maths than its going to be really challenging. </span><br /><span style="font-family: inherit;"><br /></span><h3><b><span style="font-family: inherit;">Analytical Skills</span></b></h3><span style="font-family: inherit;"><br /></span><span style="font-family: inherit;">Analytical skills in itself is a collection of a lot of sub skills. But here, we are mainly concerned with pattern recognition and identification. You should be able to scan the data and analyze it. <b>You should be able to identify different patterns and relationships between entities in the data</b> which is available for analysis. Let's take the example of a housing data that has different attributes such as house size, location, price, number of bedrooms etc. Just by going through the data you should get a rough idea which attributes are affecting the price the most. Which attributes are directly related to price and which of them are inversely proportional? Later on, you can move ahead and do regression analysis and use other algorithms to critically and accurately identify the colinearity, variance and other relationships among the different variables. Analysis of the data helps data scientist to identify deviations in data and produces insights that can be leveraged for business growth and development.</span><br /><span style="font-family: inherit;"><br /></span><h3><span style="font-family: inherit;">Coding Skills</span></h3><div><span style="font-family: inherit;"><br /></span></div><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-S1vc1QVcDy4/WbDu05JjReI/AAAAAAAAAX8/9NeT__da-ZsrJKWzF-doPvsg5uhGWYLIQCLcBGAs/s1600/R%2BPython.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Languages used by data scientist" border="0" data-original-height="400" data-original-width="698" height="227" src="https://2.bp.blogspot.com/-S1vc1QVcDy4/WbDu05JjReI/AAAAAAAAAX8/9NeT__da-ZsrJKWzF-doPvsg5uhGWYLIQCLcBGAs/s400/R%2BPython.jpg" title="Languages used by data scientist" width="400" /></a></div><div><span style="font-family: inherit;"><br /></span></div><div><span style="font-family: inherit;">Like all other IT people data scientist also use the power of machines to handle big data, complex computations, and visualizations. Hence, to interact with machines you need to know the language of computers. </span><b style="font-family: inherit;">And for a data scientist, the most popular languages are R and Python</b><span style="font-family: inherit;">. Using any of the two languages you can execute any of the </span><a href="http://www.kdnuggets.com/2016/08/10-algorithms-machine-learning-engineers.html" style="font-family: inherit;" target="_blank">machine learning algorithms</a><span style="font-family: inherit;"> on a limited amount of data. And if the size of data increases beyond a certain extent than big data technologies come into the picture. These are </span><a href="https://www.scala-lang.org/" style="font-family: inherit;" target="_blank">Scala</a><span style="font-family: inherit;">, </span><a href="https://spark.apache.org/" style="font-family: inherit;" target="_blank">spark</a><span style="font-family: inherit;">, </span><a href="https://hive.apache.org/" style="font-family: inherit;" target="_blank">hive</a><span style="font-family: inherit;">, HBase etc. which are based on Hadoop distributed file system framework. All these languages come handy when you are working with data. Hence, you need to have some coding skills to get comfortable with these languages. The better coder you are the more easily you will be able to grasp these languages.</span></div><span style="font-family: inherit;"><br /></span> All these four skills are the foundation data scientist skills, that will help you become a good data scientist. If you have these four skills, congrats! You can move ahead and take the next steps. But if you lack a few skills, don't worry. Just start working on them and in no time <b>you will have a strong foundation of expert data scientist skills</b>. Best of luck!<br /><br />Nitesh Choudharyhttps://plus.google.com/111131279020701441727noreply@blogger.com0