Data Science Weekly Newsletter - Issue 80

Issue #80

June 4 2015

Editor Picks
 
  • The Unknown Perils of Mining Wikipedia
    If a machine is to learn about humans from Wikipedia, it must experience the corpus as a human sees it and ignore the overwhelming mass of robot-generated pages that no human ever reads. We provide a cleaned corpus (also a Wikipedia recommendation API derived from it)...
  • Using Amazon Machine Learning to Predict the Weather
    Amazon recently launched their Machine Learning service, so I thought I’d take it for a spin. Machine Learning (ML) is all about predicting future data based on patterns in existing data. As an experiment I wanted to see if machine learning would be able to predict the weather of tomorrow based on weather observations...
 
 

Data Science Articles & Videos

 
  • Recurrent Neural Shady
    Inspired by a recent blog post from Andrej Karpathy, I trained a character by character Recurrent Neural Network model on Eminem lyrics. Then, using the trained model I let it generate it's own Shady lyrics by sampling from the learned distribution (one character at a time). Adding the result to some background music made it quite amusing....
  • China unveils world's first facial recognition ATM machine
    China has unveiled the world's first facial recognition ATM, which will not allow users to withdraw cash unless their face matches their IDs. The machine was created by Tsinghua University and Hangzhou-based technology company Tzekwan...
  • Extending the NFL Season to a Million Games
    If time, money, and brain damage weren’t an issue, how would you decide if Nick Foles or Sam Bradford is better? You’d take a generic team, plug in Nick Foles, and play a million games. Obviously we can’t do this in real life, but we can in computers. We just need to be able to simulate football games. That sounds great, but how do we do it?...
  • So, You Need a Statistically Significant Sample?
    Although a commonly used phrase, there is no such thing as a "statistically significant sample" – it’s the result that can be statistically significant, not the sample. Word-mincing aside, for any study that requires sampling – e.g. surveys and A/B tests – making sure we have enough data to ensure confidence in results is absolutely critical...
  • Prediction intervals for Random Forests
    An aspect that is important but often overlooked in applied machine learning is intervals for predictions, be it confidence or prediction intervals. For classification tasks, beginning practitioners quite often conflate probability with confidence...
  • My aversion to pipes
    At the risk of coming across as even more of a curmudgeonly old fart than people already think I am, I really do dislike the current vogue in R that is the pipe family of binary operators...
  • How to Evaluate Machine Learning Models, Part 4: Hyperparameter Tuning
    In the realm of machine learning, hyperparameter tuning is a “meta” learning task. It happens to be one of my favorite subjects because it can appear like black magic, yet its secrets are not impenetrable. In this post, I'll walk through what is hyperparameter tuning, why it's hard, and what kind of smart tuning methods are being developed to do something about it...
 
 

Jobs

 
  • Snr Data Scientist - Nike - Portland, OR

    Nike does more than outfit the world's best athletes. We are a place to explore potential, obliterate boundaries, and push out the edges of what can be. Nike’s Global Consumer Knowledge Center of Excellence is responsible for building and deepening a holistic view of Nike’s consumers through data and analytics. We are looking for a senior statistician to work across Nike’s consumer facing businesses to define and implement measurement strategies, instrument and analyze consumer behavior, and inform Nike’s global strategy...
 
 

Training & Resources

 
  • Kaggle R Tutorial on Machine Learing
    Always wanted to compete in a Kaggle competition but not sure you have the right skillset? This interactive tutorial by Kaggle and DataCamp on Machine Learning offers the solution...
  • Out-of-core Learning and Model Persistence using scikit-learn
    When we are applying machine learning algorithms to real-world applications, our computer hardware often still constitutes the major bottleneck of the learning process. Of course, we all have access to supercomputers, Amazon EC2, Apache Spark, etc. However, out-of-core learning via Stochastic Gradient Descent can still be attractive if we'd want to update our model on-the-fly ("online-learning"), and in this notebook, I want to provide some examples of how we can implement an "out-of-core" approach using scikit-learn...
 
 

Books

 

  • Python: Learn Python in One Day and Learn It Well

    Clear theory and a project to work through at the end...

    "I am a novice to programming and decided to learn Python as I'm told it is one of the easiest language to learn. I read a few books on Python and this is definitely one of the best. The author is able to explain difficult concepts clearly, and the project at the end definitely helped my learning..."

    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
 
 
P.S. Interested in reaching fellow readers of this newsletter? Consider sponsoring! Email us for details :) - All the best, Hannah & Sebastian
 
Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.