Data Science Weekly Newsletter - Issue 98

Issue #98

October 8 2015

Editor Picks
 
  • Five Principles For Applying Data Science For Social Good
    At DataKind, we’ve spent the last three years teaming data scientists with social change organizations, to bring the same algorithms that companies use to boost profits, to mission-driven organizations in order to boost their impact. It has become clear that using data science in the service of humanity requires much more than free software, free labor, and good intentions...So how can these well-intentioned efforts reach their full potential for real impact? Embracing the following five principles can drastically accelerate a world in which we truly use data to serve humanity....
  • Hacking The Random Walk Hypothesis
    This article is broken up into three sections. The first section will present background information about the random walk hypothesis and compares the statistical definition of randomness to the algorithmic definition. The second section will outline my Python implementation of the NIST test suite, including a brief explanation and source code for each test. The third and final section will subject a number of financial markets to these tests and end off by concluding whether or not the market are random and, if they are random, comment on the nature of that randomness...
  • Why Do We Make Statistics So Hard For Our Students?
    If you’re like me, you’re continually frustrated by the fact that undergraduate students struggle to understand statistics. Actually, that’s putting it mildly: a large fraction of undergraduates simply refuse to understand statistics...Given this, can we blame students for thinking statistics is complicated? No, we can’t; but we can blame ourselves for letting them think that it is. They think so because we consistently underemphasize the single most important thing about statistics...
 
 

A Message from this week's Sponsor:
Continuum Analytics

 

  • Introducing Anaconda, the modern open source analytics platform
    Featuring big data analytics, real time visualization, Python and R package management, push down processing, algorithm parallelization, and much more. Learn More about Anaconda
 
 

Data Science Articles & Videos

 
  • Unboxing The Random Forest Classifier: The Threshold Distributions
    In the Trust and Safety team at Airbnb, we use the random forest classifier in many of our risk mitigation models. Despite our successes with it, the ensemble of trees along with the random selection of features at each node makes it difficult to succinctly describe how features are being split. In this post, we propose a method to aggregate and summarize those split values by generating weighted threshold distributions....
  • Bot Or Not: An End-To-End Data Analysis In Python
    This written version of my talk about building a classifier with pandas, NLTK, and scikit-learn to identify Twitter bots...I’m focusing on Twitter bots primarily because they’re fun and funny, but also because Twitter happens to provide a rich and comprehensive API that allows users to access information about the platform and how it’s used. In short, it makes for a compelling demonstration of Python’s prowess for data analysis work, and also areas of relative weakness...
  • What To Do With “Small” Data?
    Many technology companies now have teams of smart data-scientists, versed in big-data infrastructure tools and machine learning algorithms, but every now and then, a data set with very few data points turns up and none of these algorithms seem to be working properly anymore. What the hell is happening? What can you do about it?...
  • What Being A Data Scientist At Chartbeat Really Means
    I’ve read many pieces over the past year trying to describe what data science actually is. There’s usually some talk about math and programming, machine learning, and A/B testing. Essentially these pieces boil down to one observation: data scientists do something with data...Ok, then, what the hell does a data scientist actually do? Now this is a question I can answer. And since I haven’t read many concise descriptions of what data scientists do day-to-day, I figured that I’d throw my hat into the ring and talk about the kind of data science we do here at Chartbeat.
  • The Wonderful World of Recommender Systems
    Recommender systems help people discover items they may like. Recommenders have become ubiquitous due to the explosive growth in digital information in recent years. In this talk, Yanir will give a brief overview of recommender systems, discuss common approaches to recommendation generation, and try to dispel some common myths and misconceptions about the field...
  • NHL Ice Hockey Season Preview 2015-2016
    I have built a new predictive model for single games and my first task for it is to simulate the 2015-2016 NHL season. Running ten thousand simulations gives me not just a decent estimate of the final standings points, but also of the likely spread of possibilities. Let's have the results first and then the methodology afterwards. Ten thousand may not seem like very many simulations, but the measured variance is similar to that obtained with millions of simulations...
  • Lessons Learned From Working At Continuum
    Last Friday was my last day working at Continuum Analytics. I enjoyed my time at the company, and wish success to it, but the time has come for me to move on...During my time at Continuum (over two years if you count a summer internship), I primarily worked on the Anaconda distribution and its open source package manager, conda. I learned a lot of lessons in that time, and I'd like to share some of them here....
  • Logistic Regression – Geometric Intuition
    Everybody who has taken a machine learning course probably knows the geometric intuition behind a support vector machine: An SVM is a large margin classifier. In other words, it maximizes the geometric distance between the decision boundary and the classes of samples...But what about logistic regression? What is the geometric intuition behind it and how does it compare to linear SVMs? Let’s find out...
  • This Car Knows Your Next Misstep Before You Make It
    An experimental new dashboard computer can not only keep track of your behavior behind the wheel, but even predict what you’re about to do next...A study by researchers at Cornell University and Stanford shows that a more advanced system could be trained to recognize the body language and behavior that precedes a particular maneuver. This could help trigger an early warning system, such as a blind spot alert, much earlier—perhaps thereby helping to prevent serious accidents, according to the academics involved...
 
 

Jobs

 
  • Data Scientist - Drizly - Boston

    We are building our data science and analytics team and looking for talented people to help us learn from our rapidly growing datasets. Our data ranges from inventory movement, to transactional/eCommerce, to behavioral, and we hope to harness this to provide better experiences for both our users and our retail partners. We are specifically interested in people with experience with prediction or recommender systems, search and ranking algorithms, and classification algorithms...
 
 

Training & Resources

 
  • Probability, Paradox, And The Reasonable Person Principle
    In this notebook [by Peter Norvig], we cover the basics of probability theory, and show how to implement the theory in Python. (You should have a little background in probability and Python.) Then we show how to solve some particularly perplexing paradoxical probability problems....
  • Understanding Empirical Bayes Estimation (using Baseball Statistics)
    This post isn’t really about baseball...This post is, rather, about a very useful statistical method for estimating a large number of proportions, called empirical Bayes estimation...Suppose you were a baseball recruiter, trying to decide which of two potential players is a better batter based on how many hits they get. One has achieved 4 hits in 10 chances, the other 300 hits in 1000 chances. While the first player has a higher proportion of hits, it’s not a lot of evidence: a typical player tends to achieve a hit around 27% of the time, and this player’s 4/10 could be due to luck. The second player, on the other hand, has a lot of evidence that he’s an above-average batter...
 
 

Books

 

  • The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World

    New release...

    "With terms like ‘Machine Learning’ and ‘Big Data’ regularly making headlines, there is no shortage of hype-filled business books on the subject. There are also textbooks that are too technical to be accessible. For those in the middle—from executives to college students—this is the ideal book, showing how and why things really work without the heavy math..."

    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
 
 
P.S. Interested in reaching fellow readers of this newsletter? Consider sponsoring! Email us for details :) - All the best, Hannah & Sebastian
 
Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.