Data Science Weekly Newsletter

Issue

May 15, 2014

‍

Editor's Picks

‍

The Forgotten Job of a Data Scientist: Editing
An article this week proclaimed, much to the data science community’s chagrin, that “most of a data scientist’s time is spent creating predictive models.” Forget about cleaning data, doing historical analyses that go into basic reports, etc. Apparently, the core job is predictive modeling. I fear for the company who hires any data scientist that believes that... [they risk not asking] one of the most important predictive modeling questions of all: Do I really need to build this model? Can I do something simpler?...

CUDA Spotlight: GPU-Accelerated Deep Neural Networks
This week’s Spotlight is on Dr. Dan Ciresan, a senior researcher at IDSIA in Switzerland and a pioneer in using CUDA for Deep Neural Networks (DNNs). His methods have won international competitions on topics such as classifying traffic signs and recognizing handwritten Chinese characters. The following is an excerpt from our interview...

‍

‍

Building a Business around Machine Learning APIs
I got a variety of reactions on Twitter following my GigaOM piece on how Data Scientists work at automating themselves. One of them I want to discuss today is about building businesses on top / around Prediction APIs such as Google's or BigML's (a.k.a. machine learning APIs)...

Large-Scale Machine Learning with Apache Spark
Spark is a new cluster computing engine that is rapidly gaining popularity — with over 150 contributors in the past year, it is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, we’ll introduce Spark and show how to use it to build fast, end-to-end machine learning workflows....

Machine Learning A Cappella - Overfitting Thriller!
A fun take on some of the challenges of overfitting from Udacity...

Data Science Anti-Pattern: The SQLoppelganger
Data scientists, attention! The time has come to call out one of the egregious anti-patterns of data science.I call it… the SQLoppelganger. Definition: A SQLoppelganger is a database query (or other analytics code) that reproduces business logic that already exists somewhere else...

Predicting Stock Swings with PsychSignal, Quandl and BigML
People like to tweet about stocks, so much so that ticker symbols get their own special dollar sign like $AAPL or $FB. What if you could mine this data for insight into public sentiment about these stocks? Even better, what if you could use this data to predict activity in the stock market?...

Robust Regression and Outlier Detection via Gaussian Processes
In the last post, I showed after removal of the outliers, one can do a linear regression on the remaining data which is called robust linear regression. However, instead of detecting the outliers then fit the regression model, we can do better. Choose a model that is robust to outliers and flexible enough to capture all main signal by excluding the outliers...

Garbage In, Garbage Out: How Anomalies Can Wreck Your Data
Flawed census data is used every year to build scientific models, do in-depth analysis, and even make large-scale policy decisions. If the data backing up a model is wildly inaccurate, then our model is useless. That is: “garbage in, garbage out.” This incident is an example of a wider issue in data analysis: anomalous data, or data that contains errors. Let’s look at a couple more examples, and how data visualization can catch these errors....

Consistency of Random Forests
Random forests are a learning algorithm proposed by Breiman (2001) which combines several randomized decision trees and aggregates their predictions by averaging. In the present paper, we take a step forward in forest exploration by proving a consistency result for Breiman's (2001) original algorithm in the context of additive regression models. Our analysis also sheds an interesting light on how random forests can nicely adapt to sparsity in high-dimensional settings...

How Popular Will Your Name Be in 25 Years?
Wondering what to name your kid? Here is how every name will rise and fall in popularity over the next 25 years...

‍

‍

Data Scientist, Beats Music - San Francisco, CA
We’re looking for a Sr. Software Engineer for our Personalization and Machine Learning team, with experience using machine learning algorithms and techniques for use and content modeling, content recommendation, as well as experience gathering and analyzing data from disparate sources...

‍

‍

The LION Community
The LIONcommunity page contains mixed materials about machine learning and optimization made available by our lab and by a growing community of active researchers and users, in particular slides related to the LIONbook
, usage cases in selected application areas, tutorial movies, etc...

SQL Server Analysis Services Neural Network Data Mining Algorithm
In data mining and machine learning circles, the neural network is one of the most difficult algorithms to explain. Fortunately, SQL Server Analysis Services allows for a simple implementation of the algorithm for data analytics. Check out this tip to learn more...

Data Analytics Handbook (Part 1)
Data Analytics Handbook (Part 2)
What is the data science industry? The Data Analytics handbook was created to inform students and young professionals and answer this question. Hear from over 30 data scientists, data analysts, CEOs, and academics from Facebook, LinkedIn, Yelp, Cloudera, and many more!...

‍

‍

Street-Fighting Mathematics:
The Art of Educated Guessing and Opportunistic Problem Solving
Not a new book, though very well reviewed...
"This book is a treasure trove of intuitive, practical, and brilliant mathematical techniques. Every person with an interest in mathematics, science, or engineering will enjoy this highly stimulating and fun book."...

‍