
Editor Picks

An article this week proclaimed, much to the data science community’s chagrin, that “most of a data scientist’s time is spent creating predictive models.” Forget about cleaning data, doing historical analyses that go into basic reports, etc. Apparently, the core job is predictive modeling. I fear for the company who hires any data scientist that believes that... [they risk not asking] one of the most important predictive modeling questions of all: Do I really need to build this model? Can I do something simpler?...

Train a neural network to recognize color contrast...

This week’s Spotlight is on Dr. Dan Ciresan, a senior researcher at IDSIA in Switzerland and a pioneer in using CUDA for Deep Neural Networks (DNNs). His methods have won international competitions on topics such as classifying traffic signs and recognizing handwritten Chinese characters. The following is an excerpt from our interview...
Data Science Articles & Videos
 Building a Business around Machine Learning APIs
I got a variety of reactions on Twitter following my GigaOM piece on how Data Scientists work at automating themselves. One of them I want to discuss today is about building businesses on top / around Prediction APIs such as Google's or BigML's (a.k.a. machine learning APIs)...
 LargeScale Machine Learning with Apache Spark
Spark is a new cluster computing engine that is rapidly gaining popularity — with over 150 contributors in the past year, it is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, we’ll introduce Spark and show how to use it to build fast, endtoend machine learning workflows....
 Data Science AntiPattern: The SQLoppelganger
Data scientists, attention! The time has come to call out one of the egregious antipatterns of data science.I call it… the SQLoppelganger. Definition: A SQLoppelganger is a database query (or other analytics code) that reproduces business logic that already exists somewhere else...
 Predicting Stock Swings with PsychSignal, Quandl and BigML
People like to tweet about stocks, so much so that ticker symbols get their own special dollar sign like $AAPL or $FB. What if you could mine this data for insight into public sentiment about these stocks? Even better, what if you could use this data to predict activity in the stock market?...
 Robust Regression and Outlier Detection via Gaussian Processes
In the last post, I showed after removal of the outliers, one can do a linear regression on the remaining data which is called robust linear regression. However, instead of detecting the outliers then fit the regression model, we can do better. Choose a model that is robust to outliers and flexible enough to capture all main signal by excluding the outliers...
 Garbage In, Garbage Out: How Anomalies Can Wreck Your Data
Flawed census data is used every year to build scientific models, do indepth analysis, and even make largescale policy decisions. If the data backing up a model is wildly inaccurate, then our model is useless. That is: “garbage in, garbage out.” This incident is an example of a wider issue in data analysis: anomalous data, or data that contains errors. Let’s look at a couple more examples, and how data visualization can catch these errors....
 Consistency of Random Forests
Random forests are a learning algorithm proposed by Breiman (2001) which combines several randomized decision trees and aggregates their predictions by averaging. In the present paper, we take a step forward in forest exploration by proving a consistency result for Breiman's (2001) original algorithm in the context of additive regression models. Our analysis also sheds an interesting light on how random forests can nicely adapt to sparsity in highdimensional settings...
Jobs

We’re looking for a Sr. Software Engineer for our Personalization and Machine Learning team, with experience using machine learning algorithms and techniques for use and content modeling, content recommendation, as well as experience gathering and analyzing data from disparate sources...
Training & Resources
 The LION Community
The LIONcommunity page contains mixed materials about machine learning and optimization made available by our lab and by a growing community of active researchers and users, in particular slides related to the LIONbook, usage cases in selected application areas, tutorial movies, etc...

The Google of Data search...
 SQL Server Analysis Services Neural Network Data Mining Algorithm
In data mining and machine learning circles, the neural network is one of the most difficult algorithms to explain. Fortunately, SQL Server Analysis Services allows for a simple implementation of the algorithm for data analytics. Check out this tip to learn more...

What is the data science industry? The Data Analytics handbook was created to inform students and young professionals and answer this question. Hear from over 30 data scientists, data analysts, CEOs, and academics from Facebook, LinkedIn, Yelp, Cloudera, and many more!...
Books

Not a new book, though very well reviewed...
"This book is a treasure trove of intuitive, practical, and brilliant mathematical techniques. Every person with an interest in mathematics, science, or engineering will enjoy this highly stimulating and fun book."
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S. Did you enjoy the newsletter? Do you have friends/colleagues who might like it too? If so, please forward it along  we would love to have them onboard :)



