Data Science Weekly Newsletter

Issue

102

November 5, 2015

‍

Editor's Picks

‍

Data Mining Reveals the Extent of China’s Ghost Cities
Overdevelopment in China has created urban regions known as ghost cities that are more or less uninhabited. Nobody knew how bad the problem was until Baidu used its Big Data Lab to find out...

The problem with the data science language wars
Like many other data tool creators, I've been annoyed by the assorted "Python vs R" click-bait articles and Hacker News posts by folks who in all likelihood might not survive an interview panel with me on it. The worst part of the superficial "R vs Python" articles is that they're adding noise where there ought to be more signal about some of the real problems facing the data science community. Let me say some very brief words about my present perspective on this...

Computer, respond to this email.
What I love about working at Google is the opportunity to harness cutting-edge machine intelligence for users’ benefit. Two recent Research Blog posts talked about how we’ve used machine learning in the form of deep neural networks to improve voice search and YouTube thumbnails. Today we can share something even wilder -- Smart Reply, a deep neural network that writes email...

‍

‍

Distribute Processing on Your Cluster with Anaconda
Using Python on distributed computing technologies like Hadoop and Spark makes it easier to create and deploy advanced analytics in production. But managing packages on your cluster can be a full-time job. And that's why we created the cluster features of Anaconda. Learn how to manage Python packages across an entire cluster with one line of code in our webcast on November 12th. Sign Up Today.

‍

‍

Why there's not one "best way" to land a data science job
Being the pragmatic and thoughtful person you are, one of the first questions you asked yourself was "What is the best way to land a data science job?" Which was all well and good until you started asking people the question and got so many different answers that somehow made the whole data science job search process seemed more and more mysterious with each additional answer...

Data mining Instagram feeds can point to teenage drinking patterns
Using photos and text from Instagram, a team of researchers from the University of Rochester has shown that this data can not only expose patterns of underage drinking more cheaply and faster than conventional surveys, but also find new patterns, such as what alcohol brands or types are favored by different demographic groups. The researchers say they hope exposing these patterns could help develop effective intervention...

Recently Watched: A Data Story (from Twitch)
Recently watched came up a couple of times in the past as a “nice to have” project, a.k.a. another one for the “maybe never” pile. After all, recency is everywhere. Netflix makes finishing all of House of Cards the default experience through recency. Sony sorts my game library by last played. I’ve been happily opening recent documents since Office 95. But it wasn’t clear it’d be valuable for Twitch until I asked our data the right question. How much of our viewership is already on recently watched channels?...

Visualizing Chess with ggplot
There are nice visualizations from chess data: piece movement, piece survaviliy, square usage by player, etc. Sadly not always the authors shows the code/data for replicate the final result. So I wrote some code to show how to do some this great visualizations entirely in R. Just for fun...

Artificial Intelligence and the Future of Work
Artificial intelligence seems like it might work the same way, creating jobs for artificial intelligence researchers and slowly displacing all other kinds of knowledge work. And while this might be where we end up a century from now, the path to get there won’t quite look the way people think...

Stop screaming already: Effects of fan distraction in NCAA basketball
Analysis on whether fans affect free throw shooting in NCAA basketball...

Rejecting the gender binary: a vector-space operation
Using vectorized teaching evaluations to model English without any gendered words...

Understanding the Bayesian approach to false discovery rates (using baseball statistics)
Sometimes, rather than estimating a value, we’re looking to answer a yes or no question about each hypothesis, and thus classify them into two groups...To solve this, we’re going to apply a Bayesian approach to a method usually associated with frequentist statistics, namely false discovery rate control...This approach is very useful outside of baseball, and even outside of beta/binomial problems...Knowing how to work with posterior predictions for many individuals, and come up with a set of candidates for further study, is an essential skill in data science...

How To Figure Out The Gaps In Your Data Science Skill Set
With your unique mixture of academic and non-academic projects, you will feel like there are gaps in your current background. You've searched around the web to see if you can find some insight into your situation, but so far no recommendations on what to do and what to learn have been personal enough for you. Although you feel like you meet the qualifications for a number of data science jobs, you worry that others are more qualified and they'll get the job instead of you...

‍

‍

Data Scientist V, Analytics - Memorial Sloan Kettering Cancer Center - NYC
The Strategy and Innovation team leverages the power of data and analytics to shape strategic decisions at Memorial Sloan Kettering Cancer Center, a world renowned organization dedicated to the progressive control and cure of cancer through programs of patient care, research, and education. We are seeking a Data Scientist who will develop computational tools and lead complex analyses that provide insight into the delivery of cancer care. This is a high visibility role with frequent exposure to executive leadership and senior clinicians...

‍

‍

Machine Learning Isn’t Data Science
Too often, Machine Learning is used synonymously with Data Science. Before I knew what both of these terms were, I simply thought that Data Science was just some new faddish word for Machine Learning. Over time though, I’ve come to appreciate the real differences in these terms...So, for those too afraid of asking, I’m going to pretend that you asked...

Advanced Jupyter Notebook Tricks — Part I
Jupyter is so great for interactive exploratory analysis that it's easy to overlook some of its other powerful features and use cases. I wanted to write a blog post on some of the lesser known ways of using Jupyter — but there are so many that I broke the post into two parts...In Part 1, today, I describe how to use Jupyter to create pipelines and reports. In the next post, I will describe how to use Jupyter to create interactive dashboards...

Cosines and correlation
This post will explain a connection between probability and geometry. Standard deviations for independent random variables add according to the Pythagorean theorem. Standard deviations for correlated random variables add like the law of cosines. This is because correlation is a cosine...

‍

‍

Now You See It: Simple Visualization Techniques for Quantitative Analysis Teaches simple, practical means to explore and analyze quantitative data...
"As someone who's done over two decades of research and development on visualization technology, I highly recommend "Now You See It" for everybody - novice to expert. Stephen Few explains visual analysis clearly and conversationally. His examples are accessible, appropriate, and beautiful..."... For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

‍