Data Science Weekly Newsletter

Issue

May 8, 2014

‍

Editor's Picks

‍

Why The R Programming Language Is Good For Business
Thanks to one company, the same code that is revolutionizing the scientific community is now moving up the ranks of the business world...

Wine Classification using Neural Networks
Neural networks can solve some really interesting problems once they are trained. They are particularly well suited for complex decision boundary problems over many variables. In this demo we will try to build a neural network that can classify wines from three wineries by thirteen attributes...

Spark is on fire
Spark is on the rise, to an even greater degree than I thought last month...

‍

‍

"Random" Predictive Content Discovery: Jarno Koponen Interview
We recently caught up with Jarno Koponen, co-founder of Random. We were keen to learn more about his background, his perspective on predictive content discovery and what he is working on now at Random...

How to create a Data-Driven Organization: One Year On
A year ago, I wrote a well-received post here entitled How do you create a data-driven organization?". I had just joined Warby Parker and set out my various thoughts on the subject at the time, covering topics such as understanding the business and customer, skills and training, infrastructure, dashboards and metrics. One year on, I decided to write an update. So, how did we do?...

Spatial Localization of Recent Ancestors for Admixed Individuals
Ancestry analysis from genetic data plays a critical role in studies of human disease and evolution. Recent work has introduced explicit models for the geographic distribution of genetic variation and has shown that such explicit models yield superior accuracy in ancestry inference over non-model-based methods. Here we extend such work to introduce a method that models admixture between ancestors from multiple sources...

Smart Umbrellas 'could collect Rain Data'
How would you fancy being a mobile weather station? Rolf Hut, from Delft University of Technology in The Netherlands, plans to turn our umbrellas into rain gauges. His prototype smart brolly has a sensor that detects raindrops falling on its canvas, and uses bluetooth to send this information via a phone to a computer...

Eurovision 2014: First predictions
For the last two years, I’ve been publishing the results of a statistical model for predicting the results of the Eurovision Song Contest. This year’s final takes place on Saturday in an abandoned shipyard in Copenhagen, so it’s time for some more predictions. I’ve made some small changes to the model this year, which have had huge consequences for the results, which I think should be a lot more accurate now....

Kaggle LSHTC4 Winning Solution
Our winning submission to the 2014 Kaggle competition for Large Scale Hierarchical Text Classification (LSHTC) consists mostly of an ensemble of sparse generative models extending Multinomial Naive Bayes. This document describes the models and software used to build our solution...

Intuition for Simulated Annealing
This post develops the intuition behind simulated annealing via lots of pictures. It's self-contained and ought to be accessible to those without a math-centric background. It also serves as a gentle introduction to more technical discussions...

Ford Data Scientist knows how to make Business and IT talk
Michael Cavaretta wants the data, the whole data, and nothing but the data. Here's what he does in one of today's hottest IT jobs...

Zipfian Academy - All 12 weeks
A week to week summary of my experience at Zipfian Academy...

‍

‍

Twitch: Data Scientist - San Franscisco, CA
Twitch is building the biggest live video broadcasting platform and community for gamers. Twitch is 4th in peak internet traffic in the U.S
, right above Hulu and below Apple. Join the team as the 3rd data scientist, and you'll get to leverage the 2.5 TB of data coming in everyday...

‍

‍

Yann LeCun will be doing an AMA in /r/MachineLearning on May 15 4PM EST
I'm happy to announce Director of AI Research at Facebook/NYU Professor Yann LeCun will be stopping by /r/MachineLearning on May 15 4:00-6:00 PM EST for an AMA. Based on the success of the last AMA, a thread will be created before the official AMA time for those who won't be able to attend...

Billion Words: Because today's language modeling standard should be higher
We [Google Research] are releasing scripts that convert a set of public data into a language model consisting of over a billion words, with standardized training and test splits, described in an arXiv paper. Along with the scripts, we’re releasing the processed data in one convenient location, along with the training and test data...

JHU Data Science: More is More
Today Jeff Leek, Brian Caffo, and I are launching 3 new courses on Coursera as part of the Johns Hopkins Data Science Specialization...

15 In-Depth Data Scientist Interviews
Over the past few months we have been lucky enough to conduct in-depth interviews with 15 different Data Scientists for our blog. The 15 interviewees have varied roles and focus areas: from start-up founders to academics to those working at more established companies; working across healthcare, energy, retail, agriculture, travel, dating, SaaS and more...

‍

‍

Data Just Right: Introduction to Large-Scale Data & Analytics
Released Dec 2013 this book is well rated (4.7 out of 5 stars on Amazon)...
"If you work with expensive enterprise strength data management/analysis products like SAS and Oracle and you want a book that will give you a map to cover the open source tools for dealing with "big data" (i.e., Hadoop, Hive, and Pig) get this. It does an amazingly good job of explaining the utility of the various tools that are used to manage *HUGE* data."...

‍