**Latest Questions**

### What are the advantages of logistic regression over decision trees?

Both Logistic regression and Decision trees are used for classification purpose such as

1. Predicting whether a particular user will click an ad in shown in the webpage.

2. Whether a customer will take a loan from bank or not

3. Identifying whether a document was written Author 鈥揂 or Author-B

Decision trees will generate the output as rules along with metrics such as Support, Confidence and Lift, while logistic regression analysis is based on calculating the odds of the outcome as the ratio of the probability of having the outcome divided by the probability of not having it.

Let us understand this better by looking at the outputs generated by these algorithms for above the case study-1. Decision Tree outputs rules like: If the ad is shown on the first page right side, the user will click the ad – Support 0.9, Confidence 0.95 and Lift 3.345 and logistic regression generates Odds ratio of the user clicking on the ad is 0.785.

1. Assumptions

The decision tree assumes the splits are axis parallel and will become more complex with the increase in number of features and multiple decision boundaries are possible. On the other hand, Logistic regression assumes there is only one decision boundary that is smooth and non-linear.

2. How the decision boundaries are constructed?

Below are two basic functions of Decision trees and Logistic regression.

A. Decision Tree

a. Selecting the best attribute/feature to divide a set at each branch, and

b. Deciding whether each branch is justified adequately. The different decision-tree programs differ in how these are accomplished.

B. Logistic Regression

a. Stepwise selections of the variables and the corresponding coefficients computed

b. The maximum-likelihood ratio is used to determine the statistical significance of the variables which will be part of the Logistic Regression equation.

3. Limitations

Complex decision trees may over fit the data and trees will become unstable. You could prune the tree to solve this. You could use L1 regularization to solve the problem of unreasonable coefficients for the independent variables

### What are some interesting case studies involving Big Data?

Obviously, there are some wonderful problems solved by data scientists. However, I love the problems I worked on and believe they are the best 馃槈

路 Detecting sarcasm in speech

路 Identify every fraudulent medicine administration amongst

hundreds of thousands of cases

路 Help physicians prescribe the most suitable medicine for a patient

based on his insurance policy

路 Detect the patterns of customers in sales data that the marketing

folks of the company did not know until then

路 Help a supply chain company plan their fleet to improve the

productivity by over 30%

The point is that it is a wonderful and yet simple field where even regular practitioners can solve wonderful problems.

### R (programming language): How important is the R programming language nowadays?

Rapid growth in big data and application of analytical algorithms has created massive opportunities for data scientists. Research was conducted on tools/languages used among Data Scientists and is as follows

Figure 1: taken from LAVASTORM Analytics: Analytics 2014 Industry Trends Survey (Page 21)

Few reasons for its popularity are :-

R is Open Source Free Software and currently there are around 5796 pacakges in CRAN package repository.

All this packages are tested regularly and are comprehensively documented.

It also has vibrant user community.

R is easily extendable via packages.

“What are the best resources to learn RHadoop?”

Using R with Hadoop

Leveraging R in Hadoop Environments

RevolutionAnalytics/RHadoop

### Which one is a better option: 1) MS in machine learning from University College London (UCL) with full funding (tuition/living/flights) or 2) MS in applied statistics & data science from Cornell with a $50K expense?

Are you interested in working right after Masters or are you interested in a PhD?

If PhD is the goal, go for UCL and then you can move to USA for PhD immediately. As you have complete assistatship in UCL and will surely get into PhD with full aid later, this is a great option if R&D, teaching is your interest. Even if you are considering PhD at a later point, pursue this.

If Masters is the goal on the other hand, UCL primary curriculum is much more machine learning oriented. If you are allowed to work in other European nations, you can consider it seriously. Cornell program is quite statistical. While elective options are there, it looks like you must go out of your way to take them. But, work permit is a huge advantage. Based on that alone, I recommend Cornell. 50K is not a major thing compared to 21 months of OTP.

### What are some Machine Learning algorithms that you should always have a strong understanding of, and why?

While developing the curriculum of INSOFE programs, I spent a lot of time pondering about this.

We studied university curricula (from computer science, stats and business schools). The top contenders are linear programming, regression, clustering, Neural networks and SVM.

Then we looked at the peer groups. Obviously, the top 10 algos are published once in 4 years I guess. The current list is C5.0, KNN, SVM, EM, K-means, Pagerank, CART, Naive Bayes and a few more. We also looked at competition sites like Kaggle and found the winning algos. Singular value decomposition, Restricted boltzman machines, random forests, spectral methods seem to be the leaders there.

Lastly, we asked industry practitioners. As expected the focus was on data engineering, feature engineering, cleaning and visualization with modeling sort of not much emphasis!

Personally, I would also add genetic algorithms to this list as a very important technique. I almost always use it for optimization.

### What types of machine learning are most commonly implemented in companies today?

I would say, it depends a lot on the industry. Again, I am talking about general use in the industry and not the work of the exclusive R&D teams. Let me talk about industries that I worked with:

Banks, retail companies: People are using regression (linear and logistic), clustering and association rules. I saw a large bank use a neural net for customer targeting. So, I assume that is being done elsewhere. It beats me but trees and rules seem to be less popular than regression!

Pharmaceutical companies (clinical research): ANOVA type of statistical data analysis

Insurance companies: I had a pleasant surprise here to see the maturity. They use almost every advanced technique (may be I did not meet similar companies in other domains). They are serious. I saw a couple of groups working on random forest, support vector machines, spectral clusters etc.

IT&web companies: As expected, these are hardcore. Trees, graphs, text mining, SVM, belief nets etc. May be they are the ones pushing the boundaries.

### What is the difference between logistic regression and Naive Bayes?

Below is the list of 5 major differences between Na茂ve Bayes and Logistic Regression.

1.

**Purpose or what class of machine leaning does it solve?**

Both the algorithms can be used for classification of the data. Using these algorithms, you could predict whether a banker can offer a loan to a customer or not or identify given mail is a Spam or ham

2. **Algorithm鈥檚 Learning mechanism**

** Na茂ve Bayes:** For the given features (x) and the label y, it estimates a joint probability from the training data. Hence this is a Generative model

**Estimates the probability(y/x) directly from the training data by minimizing error. Hence this is a Discriminative model**

*Logistic regression:*3. **Model assumptions**

** Na茂ve Bayes:** Model assumes all the features are conditionally independent .so, if some of the features are dependent on each other (in case of a large feature space), the prediction might be poor.

**It the splits feature space linearly, it works OK even if some of the variables are correlated**

*Logistic regression:*4. **Model limitations**

** Na茂ve Bayes:** Works well even with less training data, as the estimates are based on the joint density function

**With the small training data, model estimates may over fit the data**

*Logistic regression:*5. **Approach to be followed to improve the results**

** Na茂ve Bayes:** When the training data size is less relative to the features, the information/data on prior probabilities help in improving the results

**When the training data size is less relative to the features, Lasso and Ridge regression will help in improving the results.**

*Logistic regression:*### What are the coolest things that have been done by statisticians, data scientists, or machine learning experts?

Obviously, there are some wonderful problems solved by data scientists. However, I love the problems I worked on and believe they are the best 馃槈

路 Detecting sarcasm in speech

路 Identify every fraudulent medicine administration amongst

hundreds of thousands of cases

路 Help physicians prescribe the most suitable medicine for a patient

based on his insurance policy

路 Detect the patterns of customers in sales data that the marketing

folks of the company did not know until then

路 Help a supply chain company plan their fleet to improve the

productivity by over 30%

The point is that it is a wonderful and yet simple field where even regular practitioners can solve wonderful problems.

### What can be a possible time line to improve my data science skills in 1-2 years?

There are some excellent resources here. But, I thought, the more helpful approach might be a plan and hence am adding one more answer to this list.

My goal is to create a plan where you get to the level of average industry practitioner

Skills you need: Ability to take Excel/CSV data sets, pre-process and visualize; Build a model and Visualize the results

Recommended steps:

1. Download one data set from Kaggle/UCI or anywhere from the Internet. I am deliberately not giving a link as I want you to search through multiple sets. Create a deck of slides describing the business problem, ROI, current practices, their weakness etc.

Mile stone 1: Creating a business context for a problem is a crucial step in becoming a practitioner. Congrats, you have done that! You should spend a week for this provided you put in 20 hours a week.

2. Look at the attributes given. Brain storm whether you can create more attributes from them. If transactions are given, you can create average number of transaction per day, average value of transactions etc. Think and create as many new attributes as you can.

2. Download R, Deducer (my preference). They both are open source.

3. From the resources provided by others, learn the techniques and intuition behind standard data pre-processing (I mean ways in which you fill missing values, bin neumeric variables and merge categorical variables, scale data, dimensionality reduction etc.).

4. Use Excel/Deducer and create new data and pre-process the data.

Mile stone 2: Creating one big structured table where independent attributes are columns and records are rows is a huge step in solving. You should be able to do this with 4 weeks of work. Don’t forget to add a few slides in your ppt on data pre-processing

5. Learn descriptive statistics, histogram, box plot, scatter plot and bar chart. Learn to plot these in deducer/ggplot.

6. Do detailed descriptive statistics and visualizations on the data. There are excellent resources on this all over the net. I created a few videos myselg (http://beyond.insofe.edu.in/cate…)

Mile stone 3: Visualizing is considered most important interfacing step. and you are done with it. Add these to your slide deck. Allocate two weeks for this.

6. Learn linear, logistic regression and clustering from any of the resources given in these threads.

7. Apply then on your data sets and do all diagnostics. Deducer makes it easy to do this.

Mile stone 4: Congrats! You built your predictive models. I think, you need 3 weeks for this step.

8. Brain storm and think about how you can simplify and present these results. Goal is to present to a non-data scientist. Use your visualization skills again. Add these slides to your deck.

Milestone 5: Take a week or two for this.

You have created a slide deck, some code and knowledge base. Nore importantly, you solved a problem end-to-end. Viola, in approximately 12 weeks you are where 90% of data scientists are 馃檪

Now, to get to a higher level

Add more algorithms (decision trees, neural nets etc.). Learn more domains and problems. Study techniques to solve unstructured data. There are wonderful courses in the thread. Take them slowly.

Hope this helps.

### What are currently the hot topics in Machine Learning research and in real applications?

In general the following are fairly hot in machine learning and data science communities that are interested in modeling:

Deep learning: This seems to be breaking all benchmarks in accuracies in a variety of complex problems.

NLP: Understanding sentiment, sarcasm, urgency and summarizing free flowing text are being studied extensively.

Spectral methods and Kernel methods driven modeling methods are always hot problems.

From an engineering perspective, there is a lot of emphasis in building newer visualization tools and techniques. Of course, I see a new engineering model of big data every week.

I am sure, many folks at Quora are interested in these areas.

### What skills are needed for machine learning jobs?

To correctly answer this question, you need to ask yourself what is the job that you want to aim? A data scientist can aim for three different jobs. For the lack of better words (or my lack of knowedge of those words!), let me classify them as

1. Analysts, 2. Consultants, 3. Engineers

Analysts: These are the guys who do the same job repeatedly (statistical analysis in clinical trials, target marketing in banks etc.). In India, I see quite a few companies that get outsourced analytics also fall in this category. I noticed that they get data in a standard form and they use the same model to analyze and use same charts to visualize. The variance from project to project is very little.

You need to be a master of one or two modules of one tool (like SAS, SPSS) for this. Any online video and an installed version of the software and some data is good enough to get you started. You do not need to have in depth understanding of science also.

Your organization itself has a lot of inertia to try anything new. I really had a tough time to convince a bank to try decision trees (they were doing logistic regression for 20 years) as late as 2010! The manager said why do you bring new things when the old ones are working fine:-)

Also, when I talked to his team about logistic regression, I realized that they did not understand the underlying mathematics or science well enough. But, it was not a major deterrent for that specific job. They were doing fine.

Beware, these are the low end jobs in data science. Choose this path if and only if you are OK with routine and not so difficult work.

2. Consultants: These are the Mckinsey, Deloitte, Booz and Hamilton kind of guys. I also see them in dedicated analytics groups of large insurance, tech companies. They work on different problems that their clients are facing and provide needed guidance and consulting.

You need a very good aptitude to understand and communicate the business problems at a big level (sort of MBAish skills). You need to be very good with a few algorithms (standard ones like trees, nearest neighbors, regression, naive bayes). If you position yourself as a data scientist and not a business consultant, you need working knowledge of more advanced algorithms also (support vector machines, beliefnets, neural nets etc.). I strongly recommend one language to implement these (R, SAS, SPSS…) hands on. Infact, now a days, I am teaching R/Shiny for my students so they can quickly put up interactive demos. I strongly recommend a visualization tool (ggplot in R or Tableu or Qlikview).

I also emphasize on understanding the underlying mathematics intuitively. You should be able to play and experiment and not just use. The problem solving and logical skills are very important.

3. Engineers: These are the product guys. Google/Amazon/FB and a score of start-ups etc. need data guys who can code and build products.

You need to be very good at SQL and one language (my favorite is Python but Java etc. is fine). Now a days, NOSQL skills (Mongo, Cassandra, HBASe etc.) and Hive/PIG kind of big data scripting skills are also very useful. You need to be very good with machine learning algorithms, efficient engineering of software and standard coding and development procedures. You most likely will work on technology and hence the business and consulting skills are not as important as the previous one.

I cannot avoid talking about the last one (my own profession): Scientist! In all three above, interestingly, an intuitive understanding of the algorithms is good enough and you do not need really deep math (I know I am scandalizing a purist here!).

If your goal is to teach and do research in data science, you need the skills mentioned in either 2 (if you want to go for teaching in a business school) or 3 (if you want to teach in a CS school). In addition, you must be extremely good in advanced undergraduate mathematics (calculus, linear algebra and coordinate geometry). Designing newer algorithms and mathematics becomes very important here.

So, to sum it up, the skills you need to hone depend on the specific interests you want to pursue as a data scientist. Realize that data science is very broad and hence may lead to different professions. You pick what you love and tune yourself for that.

### How do I become a data scientist?

There are some excellent resources here. But, I thought, the more helpful approach might be a plan and hence am adding one more answer to this list.

My goal is to create a plan where you get to the level of average industry practitioner

Skills you need: Ability to take Excel/CSV data sets, pre-process and visualize; Build a model and Visualize the results

Recommended steps:

1. Download one data set from Kaggle/UCI or anywhere from the Internet. I am deliberately not giving a link as I want you to search through multiple sets. Create a deck of slides describing the business problem, ROI, current practices, their weakness etc.

Mile stone 1: Creating a business context for a problem is a crucial step in becoming a practitioner. Congrats, you have done that! You should spend a week for this provided you put in 20 hours a week.

2. Look at the attributes given. Brain storm whether you can create more attributes from them. If transactions are given, you can create average number of transaction per day, average value of transactions etc. Think and create as many new attributes as you can.

2. Download R, Deducer (my preference). They both are open source.

3. From the resources provided by others, learn the techniques and intuition behind standard data pre-processing (I mean ways in which you fill missing values, bin neumeric variables and merge categorical variables, scale data, dimensionality reduction etc.).

4. Use Excel/Deducer and create new data and pre-process the data.

Mile stone 2: Creating one big structured table where independent attributes are columns and records are rows is a huge step in solving. You should be able to do this with 4 weeks of work. Don’t forget to add a few slides in your ppt on data pre-processing

5. Learn descriptive statistics, histogram, box plot, scatter plot and bar chart. Learn to plot these in deducer/ggplot.

6. Do detailed descriptive statistics and visualizations on the data. There are excellent resources on this all over the net. I created a few videos myselg (http://beyond.insofe.edu.in/cate…)

Mile stone 3: Visualizing is considered most important interfacing step. and you are done with it. Add these to your slide deck. Allocate two weeks for this.

6. Learn linear, logistic regression and clustering from any of the resources given in these threads.

7. Apply then on your data sets and do all diagnostics. Deducer makes it easy to do this.

Mile stone 4: Congrats! You built your predictive models. I think, you need 3 weeks for this step.

8. Brain storm and think about how you can simplify and present these results. Goal is to present to a non-data scientist. Use your visualization skills again. Add these slides to your deck.

Milestone 5: Take a week or two for this.

You have created a slide deck, some code and knowledge base. Nore importantly, you solved a problem end-to-end. Viola, in approximately 12 weeks you are where 90% of data scientists are 馃檪

Now, to get to a higher level

Add more algorithms (decision trees, neural nets etc.). Learn more domains and problems. Study techniques to solve unstructured data. There are wonderful courses in the thread. Take them slowly.

Hope this helps.

### What should everyone know about making good charts and graphs to represent data?

I created a series of videos explaining this. Here are the links.

http://beyond.insofe.edu.in/category/essentialskillstoolkit/data-visualization/

### Data Science: How do scientists use statistics?

Let me explain this with an example. Let us say, I want to understand how much of wear a component undergoes when used for a year in salt water. There are actually three ways of solving this problem: The engineers way: Engineer or scientists study what is wear. They define it as gradual reduction of material. Then using molecular kinetics and fundamental laws of thermodynamics and a bunch of assumptions, derive equation for the wear (something like wear=square of area times cube of salt times etc.). They validate it with experimental observations and conclude. Then they design some thumb rules and charts and use this in their practice. This elegant process (use some fundamental laws, mathematics and experiments to solve a problem) is called deductive learning. This used to be a very popular way of doing science from Newton to 1950s. The computer scientist way: Computer science added a new way of solving problems called simulation. In this, I design the properties of a salt water molecule and I also design the properties of a molecule on the surface of the vessel. Individual components are easier to design but interactions are difficult to design. Now, I create a million surface molecules and a billion molecules of salt water. I let them randomly react and then study what happens. I document the knowledge. You can see that the approach here is fundamentally different from the previos way (model the simplest and let them randomly interact). Simulations are extremely powerful where non linear problems need to be solved and deductive learning simply cannot track the complexity involved. The statisticians way: This is becoming popular with the ease with which data is being collected now a days. Statistician looks at 1000 vessels and their wear after an year in salt water. They also measure the wear of a vessel in regular water and no water. He/she then computes interesting statistic and concludes that if a vessel is in salt water for a year it wears 5% more than regular water and 50% more than vessels not in water. These numbers are just examples and not correct. I am using them to communicate the point. So, statistical way is to study the behavior in large samples in carefully designed experiments and report the measurements with a variety of statistical metrics. The data scientist’s way is slightly different but need not be studied in this context. There are many socialogical and systemic problems where fundamental deductive learning is impractical. The fact that things are networked also lead to complicated interactions between parts that make deductive learning not feasible. In such situations, scientists use statistical analysis or the simulation to gain the knowledge. Many a times, knowledge gaines through one method is validated through the other.

### What are some software and skills that every Data Scientist should know?

I have seen some wonderful answers already in the list below. But, this question is important enough and more importantly rich enough to generate different perspectives. Here, let me present mine. I give this answer to my students on their first day.

To correctly answer this question, you need to ask yourself what is the job that you want to aim? A data scientist can aim for three different jobs. For the lack of better words (or my lack of knowedge of those words!), let me classify them as

1. Analysts, 2. Consultants, 3. Engineers

Analysts: These are the guys who do the same job repeatedly (statistical analysis in clinical trials, target marketing in banks etc.). In India, I see quite a few companies that get outsourced analytics also fall in this category. I noticed that they get data in a standard form and they use the same model to analyze and use same charts to visualize. The variance from project to project is very little.

You need to be a master of one or two modules of one tool (like SAS, SPSS) for this. Any online video and an installed version of the software and some data is good enough to get you started. You do not need to have in depth understanding of science also.

Your organization itself has a lot of inertia to try anything new. I really had a tough time to convince a bank to try decision trees (they were doing logistic regression for 20 years) as late as 2010! The manager said why do you bring new things when the old ones are working fine:-)

Also, when I talked to his team about logistic regression, I realized that they did not understand the underlying mathematics or science well enough. But, it was not a major deterrent for that specific job. They were doing fine.

Beware, these are the low end jobs in data science. Choose this path if and only if you are OK with routine and not so difficult work.

2. Consultants: These are the Mckinsey, Deloitte, Booz and Hamilton kind of guys. I also see them in dedicated analytics groups of large insurance, tech companies. They work on different problems that their clients are facing and provide needed guidance and consulting.

You need a very good aptitude to understand and communicate the business problems at a big level (sort of MBAish skills). You need to be very good with a few algorithms (standard ones like trees, nearest neighbors, regression, naive bayes). If you position yourself as a data scientist and not a business consultant, you need working knowledge of more advanced algorithms also (support vector machines, beliefnets, neural nets etc.). I strongly recommend one language to implement these (R, SAS, SPSS…) hands on. Infact, now a days, I am teaching R/Shiny for my students so they can quickly put up interactive demos. I strongly recommend a visualization tool (ggplot in R or Tableu or Qlikview).

I also emphasize on understanding the underlying mathematics intuitively. You should be able to play and experiment and not just use. The problem solving and logical skills are very important.

3. Engineers: These are the product guys. Google/Amazon/FB and a score of start-ups etc. need data guys who can code and build products.

You need to be very good at SQL and one language (my favorite is Python but Java etc. is fine). Now a days, NOSQL skills (Mongo, Cassandra, HBASe etc.) and Hive/PIG kind of big data scripting skills are also very useful. You need to be very good with machine learning algorithms, efficient engineering of software and standard coding and development procedures. You most likely will work on technology and hence the business and consulting skills are not as important as the previous one.

I cannot avoid talking about the last one (my own profession): Scientist! In all three above, interestingly, an intuitive understanding of the algorithms is good enough and you do not need really deep math (I know I am scandalizing a purist here!).

If your goal is to teach and do research in data science, you need the skills mentioned in either 2 (if you want to go for teaching in a business school) or 3 (if you want to teach in a CS school). In addition, you must be extremely good in advanced undergraduate mathematics (calculus, linear algebra and coordinate geometry). Designing newer algorithms and mathematics becomes very important here.

So, to sum it up, the skills you need to hone depend on the specific interests you want to pursue as a data scientist. Realize that data science is very broad and hence may lead to different professions. You pick what you love and tune yourself for that.

### What are the new approaches for data modeling?

Very interesting question which can be answered in multiple perspectives.

Techniques: If you are looking at techniques in data modelling, there are quite a few that are exploding. Deep learning, spectral methods, kernel methods, probabilistic graphical models, social networking analytics are all the latest and fastest growing areas.

Business verticals: We are also seeing a lot of interest in data science applications in the entire circle of health care industries like pharmaceutical industries, hospitals and insurance companies. Previously only banks and retail organisations used to be analytics savvy. So, if I interpret your questions as what are the areas where data science is becoming a new approach to problem solving, I advise you to watch out the healthcare sector.

Horizontal problems: we often hear a lot from clients from a variety of verticals about their need to solve questions related to unstructured data analysis in the context of social media content. Data visualization is also a capability that is generating a lot of interest in the corporate world.

### What's the easiest way to learn machine learning?

There are some excellent resources here. But, I thought, the more helpful approach might be a plan and hence am adding one more answer to this list.

My goal is to create a plan where you get to the level of average industry practitioner

Skills you need: Ability to take Excel/CSV data sets, pre-process and visualize; Build a model and Visualize the results

Recommended steps:

1. Download one data set from Kaggle/UCI or anywhere from the Internet. I am deliberately not giving a link as I want you to search through multiple sets. Create a deck of slides describing the business problem, ROI, current practices, their weakness etc.

Mile stone 1: Creating a business context for a problem is a crucial step in becoming a practitioner. Congrats, you have done that! You should spend a week for this provided you put in 20 hours a week.

2. Look at the attributes given. Brain storm whether you can create more attributes from them. If transactions are given, you can create average number of transaction per day, average value of transactions etc. Think and create as many new attributes as you can.

2. Download R, Deducer (my preference). They both are open source.

3. From the resources provided by others, learn the techniques and intuition behind standard data pre-processing (I mean ways in which you fill missing values, bin neumeric variables and merge categorical variables, scale data, dimensionality reduction etc.).

4. Use Excel/Deducer and create new data and pre-process the data.

Mile stone 2: Creating one big structured table where independent attributes are columns and records are rows is a huge step in solving. You should be able to do this with 4 weeks of work. Don’t forget to add a few slides in your ppt on data pre-processing

5. Learn descriptive statistics, histogram, box plot, scatter plot and bar chart. Learn to plot these in deducer/ggplot.

6. Do detailed descriptive statistics and visualizations on the data. There are excellent resources on this all over the net. I created a few videos myselg (http://beyond.insofe.edu.in/cate…)

Mile stone 3: Visualizing is considered most important interfacing step. and you are done with it. Add these to your slide deck. Allocate two weeks for this.

6. Learn linear, logistic regression and clustering from any of the resources given in these threads.

7. Apply then on your data sets and do all diagnostics. Deducer makes it easy to do this.

Mile stone 4: Congrats! You built your predictive models. I think, you need 3 weeks for this step.

8. Brain storm and think about how you can simplify and present these results. Goal is to present to a non-data scientist. Use your visualization skills again. Add these slides to your deck.

Milestone 5: Take a week or two for this.

You have created a slide deck, some code and knowledge base. Nore importantly, you solved a problem end-to-end. Viola, in approximately 12 weeks you are where 90% of data scientists are 馃檪

Now, to get to a higher level

Add more algorithms (decision trees, neural nets etc.). Learn more domains and problems. Study techniques to solve unstructured data. There are wonderful courses in the thread. Take them slowly.

Hope this helps.

### What is like to be a data scientist at the CIA?

This sounds like a fun question and hence let me venture.

**A disclaimer.聽 **I never worked for CIA or any detective agency.聽 However, I worked as a scientist for defence research and also a friend of mine sold his analytics products to CIA.聽 I consulted with a detective wing of a police division once.聽 So, I shall be using those experiences to answer!

The mission of CIA according to their site is “Preempt threats and further US national security objectives by聽 collecting intelligence that matters, producing objective all-source聽 analysis, conducting effective covert action as directed by the聽 President, and safeguarding the secrets that help keep our Nation safe.”

So, CIAs most primary challenges must be

1.聽 Collecting, processing large volumes of data (most of it is unstructured).聽 So, Big Data and hacking skills must be quite useful for a CIA data scientist.

2.聽 Explicable pattern extraction (if-then rules, visualizations etc.) must be the biggest need.聽 When I see activity 1, activity 2 etc. I can predict this kind of an issue.聽 So, classical machine learning and statistics must be quite important too.

3.聽 But, most important aspect of a data scientist job is to explain convincingly your findings to more important, less tech savvy guys who anyway do not believe your approach 馃檪

### How are BSP (binary space partition) trees used in machine learning algorithms?

One of the standard approaches in machine learning is called “instant based learning (IBL)”. K-Nearest neighbors algorithm is perhaps the most known amongst these appraoches. Special cases of BSPs called K-D trees are very often used in real world engineering of K-NN.

Let me get to details:

IBLs require that you identify one or more nearest neighbors to a given point or a record to make a prediction. For example, if I want to figure out whether to classify a transaction as good or fraud, I search for the nearest transaction whose status is known. If it is fraud, I label the original transaction also as fraud or vice versa.

Typically, you compute distances using known attributes. So, for the transaction data, I might look at attributes like demography of the customer who made the transactions, past history of the transactions of the customers, transaction details (amount, ip etc.). All these might add up to several dozens of attributes. If I have around 1,000,000 known transactions, for predicting the status of each new point, I need to measure its distances with 1,000,000 points (each with several dozens of dimension) to find its nearest neighbor.

Clearly, a major disadvantage is the time of computation .

So, people use a variety of techniques to speed up. Some focus on creating a subset of the data.

An interesting technique is to represent the data in the form of a specific type of binary space partitioning tree called K-D tree. Checkout Wikipedia to learn more about it.

KD tree enables me to divide the data in a structured way and quickly remove most points with which I need not compute the distance to find the nearest neighbor as they cannot be nearest neighbors. This reduces the time of computation drastically. But, in very large dimensional problems, BSPs fail.

Of course, decision trees are geometrically space partitioning algorithms too. CART in particular is a binary space partioning tool. However, in regular BSPs we use some geometric criteria (like distance) for partitioning. But, in decision trees, some measures of impurity are minimized while partitioning. As I assume you are not interested in this part, I shall not delve into those methods.

### Which course is the best to take for an MS in the USA, data science or machine learning?

Perhaps, a more pertinent question to ask yourself is which department do you want to study in? If the program is offered by multiple departments, see which one is the main contributor. Machine learning is almost always offered by the Computer Science School and hence there is no confusion. But, data science is offered by computer science, business/engineering or statistics/mathematics schools or a combo.

a. If the program is offered by computer science department, emphasis will be on programing, algorithms, data structures, software engineering, Big data etc.. So if you want to work for a software product company or in R&D labs of software companies, machine learning programs or data science programs offered by CS schools are good

b. If the program is offered by business or engineering schools, emphasis will be on real world data, tools, analysis and consulting. They will also focus a lot on OR (operations research). If your goal is to become operations guy in a domain company (healthcare/retail etc.) or a consultant in a business consulting company (Mckinsey/Accenture etc.), predictive analytics programs from business schools or engineering schools are more relevant.

c. Stats departments focus quite a bit on mathematics. If you are one of those guys who love math/equations/field surveys/design of experiments, this is the program to go for. There are plenty of opportunities in academia and also in industry for statistician. I am sure, you already read somewhere that someone said stats is the sexiest job of the century. But, beware! I saw men weeping like boys and women like girls after an introductory stats class on mathematical description of multi-dimensional normal curve!!!

### What are some open questions Quora data scientists are working on/would like to know the answer to?

In general the following are fairly hot in machine learning and data science communities that are interested in modeling:

Deep learning: This seems to be breaking all benchmarks in accuracies in a variety of complex problems.

NLP: Understanding sentiment, sarcasm, urgency and summarizing free flowing text are being studied extensively.

Spectral methods and Kernel methods driven modeling methods are always hot problems.

From an engineering perspective, there is a lot of emphasis in building newer visualization tools and techniques. Of course, I see a new engineering model of big data every week.

I am sure, many folks at Quora are interested in these areas.

1) Is MapReduce era coming to end? May be because of Spark. why?

Which language should I master to learn & work on Spark. I am ok with Java and Python (rate 5/10).

I started learning Scala and I am on 5 on scale of 10 again.

2) How different is Shark, when compared to Hive. Both in syntax and performance?

Map-Reduce paradigm is very eloquent for batch processing applications. If we are looking at applications that need to churn large amount of data especially in the field of Data analysis, the computing world is inclined towards looking at alternative paradigms like BSP and in-Memory computations. Spark is a tightly coupled set of libraries that provide all features required to perform high scale data analysis jobs. [Typical usecases in Machine learning and Enterprise datawarehouse applications]

In conjunction with YARN, Spark provides a efficient way of data processing, which would be an alternative however not replacement for Map-Reduce paradigm. There would be changes in the Hadoop framework to accommodate the dynamics provided by Spark.

Loosely coupled set of Projects – Industry tested and benchmark results available – Typical Batch processing applications – Hadoop and MR

Tightly coupled libraries – All in one data analysis opensource product – In Memory computations – YARN and Distributed FileSystem dependence – High performance and scale – Relatively new project – Not many benchmark studies – Spark

Scala and Python are gaining popularity in the Industry. Any one of them with a scale of 7+ is appreciated.

More about Spark Vs MR here:

http://www.informationweek.com/big-data/big-data-analytics/will-spark-google-dataflow-steal-hadoops-thunder

2) How different is Shark, when compared to Hive. Both in syntax and performance?

Ans: Shark came in to address the performance issues in HIVE and to provide a platform for interactive SQL on Hadoop. So the performance improvements are those that are embedded in the physical execution engine part. Enterprise distributions of Hadoop have customized variants of HIVE that enhance the performance on Open source distribution[Like HAWQ and Qubole Hive as a service]. However, Shark project re-used most of the codebase from HIVE and hence it was getting difficult to scale and perform at the level of Spark. Hence the support/code for Shark project has come to an end and the focus is now on Spark SQL on Spark + DistributedFileSystem.

More on Shark Vs Hive Vs Spark SQL on the data bricks blog:

http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html

Hi,

How essential is it to learn java in order to work with hadoop for data analytics purpose, considering that the person has no knowledge of java

Also does other languages like Python would serve the purpose.

Considering Java is more difficult to learn, so if a person has to make a choice between Java and Python, which one should he learn