Data Science Weekly Newsletter

Issue

July 24, 2014

‍

Editor's Picks

‍

Dropout: A Simple Way to Prevent Neural Networks from Overfitting
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem...

Introducing tidyr
tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modelling packages)...

Airbnb Is Quietly Building the Smartest Travel Agent of All Time
Under the covers, Airbnb has quietly begun an ambitious effort to painstakingly mine the treasure trove of data contained in the site’s customer reviews and host descriptions to create a smarter way of traveling. It turns outs Airbnb is more than a travel website — it’s a stealth big data company...

‍

‍

Leading from the Back: Making Data Science Work at a UX-driven Business
MailChimp's success as a start-up wasn't built on data. It was built on a user experience that placed an intuitive and friendly interface on email marketing and removed much of the busy work. So how does a company whose business is not data, use its massive data set? John Foreman, author of the Excel-based data science book Data Smart and Chief Scientist for MailChimp, discusses what it means to "lead from the back" in data science, even if that sometimes means breaking out a spreadsheet in favor of Hadoop...

Neglected Machine Learning ideas
This post is inspired by the “metacademy” suggestions for “leveling up your machine learning.” They make some halfway decent suggestions for beginners. The problem is, these suggestions won’t give you a view of machine learning as a field; they’ll only teach you about the subjects of interest to authors of machine learning books, which is different...

Creating the "Dropbox of your Genome": Reid Robison Interview
We recently caught up with Reid Robison, MD, MBA and CEO at Tute Genomics. We were keen to learn more about his background, his perspectives on the evolution of genomics, what he's working on now at Tute - and how machine learning is helping...

From Boom to Bust: Building a Predictive Quarterback Model
This past off-season I took it upon myself to develop a metric for evaluating quarterback prospects for the NFL draft. My goal was to create a metric that could ultimately help predict which draft-eligible quarterbacks would be most likely to succeed in the NFL by identifying which traits quarterback prospects had in common with successful NFL quarterbacks when they were coming out of college...

An exploratory statistical analysis of the 2014 World Cup Final
This notebook shows how you can use play-by-play data to analyse a football match, showing custom measures and visualizations to better understand the sport (taking the World Cup final as a case study)...

Data Mining at NASA to Teaching Data Science at GMU: Kirk Borne Interview
We recently caught up with Kirk Borne, trans-disciplinary Data Scientist and Professor of Astrophysics and Computational Science at George Mason University. We were keen to learn more about his background, his ground-breaking work in data mining and how it was applied at NASA, as well as his perspectives on teaching data science and how he is contributing to the education of future generations...

A platform for large-scale neuroscience
Talk from Jeremy Freeman on analyzing zebrafish neural activity through ApacheSpark...

MCMC: Hamiltonian Monte Carlo (a.k.a. Hybrid Monte Carlo)
Here we introduce basic analytic and numerical concepts for simulation of Hamiltonian dynamics. We then show how Hamiltonian dynamics can be used as the Markov chain proposal function for an MCMC sampling algorithm (HMC)...

Doing Data Science in a Startup: The Hard Truth
I hate to break it to you, but a high-tech Internet startup is not a natural environment to do research. Most startups come into existence around a very applicable and practical idea (hopefully), which either requires no scientific research or the core research was already done by the founders before the startup came to be. However, there are a number of advantages that can make startups a much more attractive working experience than classic academic-style research...

Aspiring Data Scientist? Here Are Some At Work Project Ideas
Do you find yourself wanting to move into Data Science but keep hearing "get some data, analyze it, and you'll be fine..."? Have you developed many of the base skills for data science, such as programming, data analysis, and/or visualization but are unsure of how to apply them? Are you looking to differentiate yourself from the ever-growing pile of aspiring "data scientist" who have taken the usual Coursera classes and done Kaggle competitions? You are not alone...

‍

‍

Software Developer - Data Science - Mailchimp
MailChimp's Data Science Team is seeking a software developer to help us build internal tools and processes. We don’t care about pedigree or what languages or stacks you’ve worked in, we’re just looking for performance-minded developers that listen hard and change fast. In fact, if you’d rather send us code than polish up your resumé, that works for us. You’ll work with our data scientists and our product developers to turn research into internal services that can move enormous piles of data for statistical analysis...

‍

‍

Julia for Data Science
Code examples to support the most common Data Science use-cases...

Fuzzy Matching with Yhat
Ever had to manually comb through a database looking for duplicates? Anyone that's ever had a data entry job probably knows what I'm talking about. It's not fun! In this post I'm going to show you how you can write a simple, yet effective algorithm for finding duplicates in your data...

‍

‍

TThe History of Statistics: The Measurement of Uncertainty before 1900
A definitive work on the early development of statistics...
"Stigler is unrivaled as a statistician who researches the history of statistics. This covers the famous mathematicians and statisticians who developed the foundation on which probability and statistics blossomed in the 20th Century. He is thorough and accurate and his writing is always clear and interesting. ..."...

‍