Data Science Weekly Newsletter

Issue

August 13, 2015

‍

Editor's Picks

‍

Sorting Algorithm Animations
Amazing visualization / explanation - sorting algorithms in one gif: These pages show 8 different sorting algorithms on 4 different initial conditions...

Baidu’s ‘Medical Robot’: Chinese Search Engine Reveals Its AI for Health
Frustration with China’s overburdened health care system informed Baidu’s latest product: A voice translation app akin to WebMD. Users rattle off a list of symptoms, such as achy joints, red eyes and a cough, and the Chinese search giant sends an immediate diagnostic suggestion (flu, 75 percent odds). Then it links users to a nearby medical specialist...

Navigating Themes in Restaurant Reviews with Word Mover’s Distance
What does the sentence “The Sicilian gelato was extremely rich” have in common with “The Italian ice-cream was very velvety”? To a human, the two sentences (incidentally taken from two different reviews of the same restaurant) have a similar key theme: This restaurant has a gelato dish diners are raving about. But consider this from the perspective of a machine: Apart from the words “the” and “was” (which are ubiquitous across reviews and considered stop-words), there are no words in common. How can we teach a machine how to learn that these two sentences have similar themes? ...

‍

‍

DataNerd
Create a free account with New Relic and get this swanky shirt for FREE!

‍

‍

Building the Next New York Times Recommendation Engine
The New York Times publishes over 300 articles, blog posts and interactive stories a day. Refining the path our readers take through this content — personalizing the placement of articles on our apps and website — can help readers find information relevant to them, such as the right news at the right times, personalized supplements to major events and stories in their preferred multimedia format. In this post, I’ll discuss our recent work revamping The New York Times’s article recommendation algorithm...

Survival analysis in R – step by step guide
I recently was looking for methods to apply to time-to-event data and started exploring Survival Analysis Models. In this post, I’m exploring basic KM estimator which is a nonparametric estimator of the survival function using a real dataset (on time to death for 80 males who were diagnosed with different types of tongue cancer, from packageKMsurv) and a simulated dataset (using packagesurvsim)...

Machine Learning Used To Predict Fine Wine Price Moves
Curiosity about the limits of machine learning led former trader, UCL academic and startup founder, Dr Tristan Fletcher, to apply complex AI techniques to the — on the surface — rather chaotic arena of fine wine pricing, comparing them with trading techniques used for more typical asset classes...

Frequentism and Bayesianism V: Model Selection
Here I am going to dive into an important topic that I've not yet covered: model selection. We will take a look at this from both a frequentist and Bayesian standpoint, and along the way gain some more insight into the fundamental philosophical divide between frequentist and Bayesian methods, and the practical consequences of this divide...

Learning Seattle’s Work Habits from Bicycle Counts (with R!)
This is an R version of Learning Seattle’s Work Habits from Bicycle Counts [featured in last week's newsletter!]. It more or less mimics the original Python code to offer an equivalent output. If all goes well, you might just run this .Rmd file and a nice HTML output will be generated...

How to Create NBA Shot Charts in Python
In this post I go over how to extract a player's shot chart data and then plot it using matplotlib and seaborn...

Teaching Machines to Understand Us
A reincarnation of one of the oldest ideas in artificial intelligence could finally make it possible to truly converse with our computers. And Facebook has a chance to make it happen first...

Baidu explains how it’s mastering Mandarin with deep learning
Baidu senior research engineer Awni Hannun presented on a new model that the Chinese search giant has developed for handling voice queries in Mandarin. The model, which is accurate 94 percent of the time in tests, is based on a powerful deep learning system called Deep Speech that Baidu first unveiled in December 2014...

Google details how it cut Google Voice transcription error rates by 50%
Google today explained how its researchers have improved the speech recognition systems underlying the transcription for voicemails in Google Voice. Last month Google disclosed that the recognition error rate in Google Voice had gone down by 50 percent, and now Google is talking about how it achieved that success...

‍

‍

Director, Data Science & Analytics - The Weather Company - Andover, MA At WSI, weather means business. We are the world's leading provider of weather-driven business solutions that enable enterprises to make better decisions using the most accurate, precise and resolute weather data available. We serve some of the world's biggest brands in the aviation, energy, insurance, and media markets, plus multiple federal and state government agencies. Based on growth and expansion in analytics we are searching for a leader to build WSI’s capabilities in the data sciences, working closely with leaders across the company to build a variety of models, recommenders, and algorithms used by WSI customers to make critical weather related business decisions...

‍

‍

Understanding Statistical Power and Significance Testing
Much has been said about significance testing – most of it negative. Methodologists constantly point out that researchers misinterpret p-values. Some say that it is at best a meaningless exercise and at worst an impediment to scientific discoveries. Consequently, I believe it is extremely important that students and researchers correctly interpret statistical tests. This visualization is meant as an aid for students when they are learning about statistical hypothesis testing...

Comparison of machine learning libraries used for classification
This project aims at a minimal benchmark for scalability, speed and accuracy of commonly used implementations of a few machine learning algorithms. The target of this study is binary classification with numeric and categorical inputs (of limited cardinality i.e. not very sparse) and no missing data...The algorithms studied are a) linear (logistic regression, linear SVM), b) random forest, c) boosting, and d) deep neural network...in various commonly used open source implementations like 1) R packages, 2) Python scikit-learn, 3) Vowpal Wabbit, 4) H2O, 5) xgboost, and 6) Spark MLlib...

A Beginner’s Guide to Restricted Boltzmann Machines
Given their relative simplicity, restricted Boltzmann machines are the first neural network we’ll tackle. In the paragraphs below, we describe in diagrams and plain language how they work....

‍

‍

Effective Python: 59 Specific Ways to Write Better Python Recommended by several readers of the newsletter...
"Effective Python is a time-efficient way to learn – or remind yourself – what the best practices are and why we use them. It’s a concise book of practical techniques to write maintainable, performant and robust code using practices widely accepted in the community..."... For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

‍