Receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe at any time. Your e-mail address is safe.

Data Science Weekly Newsletter
January 23, 2020

Editor's Picks

  • Smulemates: A Karaoke Singing Style Recommendation System
    The aspect of karaoke that I love the most is that it brings people together. In fact, there is a whole social app around this concept, called Smule. As of 2018, this popular karaoke app had over 50 million monthly active users...With millions of users on the app, how can we make it easier for users to choose who to sing with? With this project, I propose that Smule add a feature called Smulemates, a recommendation system that suggests other users with similar singing styles...
  • Near-perfect point-goal navigation from 2.5 billion frames of experience
    The AI community has a long-term goal of building intelligent machines that interact effectively with the physical world, and a key challenge is teaching these systems to navigate through complex, unfamiliar real-world environments to reach a specified destination — without a preprovided map. We are announcing today that Facebook AI has created a new large-scale distributed reinforcement learning (RL) algorithm called DD-PPO, which has effectively solved the task of point-goal navigation using only an RGB-D camera, GPS, and compass data. Agents trained with DD-PPO (which stands for decentralized distributed proximal policy optimization) achieve nearly 100 percent success in a variety of virtual environments, such as houses and office buildings. We have also successfully tested our model with tasks in real-world physical settings using a LoCoBot and Facebook AI’s PyRobot platform...
  • Talking to Myself or How I Trained GPT2-1.5b for Rubber Ducking using My Facebook Chat Data
    OpenAI’s pretrained Gpt2 models have been all the rage in nlp model finetunning...My particular interest has been in applying it to my personal chat data, so I can talk to arbitrary friends and more importantly myself whenever I want to. Why? Well, sometimes if you don’t hear voices you have to go and program the voices into being. More seriously, I wished to be able to talk with a version of myself so I can get an outside view of how I think, for analysing my tendencies and problems, rubber ducking, self-therapy and - let’s face it - most importantly for fun...

A Message From This Week's Sponsor

Data scientists are in demand on Vettery

Vettery is an online hiring marketplace that's changing the way people hire and get hired. Ready for a bold career move? Make a free profile, name your salary, and connect with hiring managers from top employers today.

Data Science Articles & Videos

  • Smarter Pricing for Airbnb Using Machine Learning
    Airbnb introduced its smart pricing tool several years ago, but the problem is that the price suggestions are too low, and hosts noticed their revenues decrease when they used the tool. Third-party solutions like Hosty and Beyond Pricing have tried to fix this with models that increase hosts’ revenue; however, these are inflexible and uninterpretable... So my goal was to build a tool for new and old hosts to maximize bookings without dropping their prices too low...
  • I, Storytelling Bot - How a little bot spin narratives through machine learning
    I have created a simple bot to generate new text based on seeding text randomly chosen or entered. The final prediction candidate was a Deep Learning model using Keras/Tensorflow libraries: An LSTM model using word-tokenisation and pre-trained Word2vec (Gensim). The model was trained on a dataset built with free short stories available from Pathfinder...
  • Make your data pipeline less chaotic by one versioned definition of data - Using a central protobuf schema at Sweden’s Television
    One of the most popular TV Shows in Sweden is about following the moose as they travel, for 21 days...It turns out filming the moose is not the only thing that Swedish Television does...While there are many things to talk about when it comes to the collection and analysis of data, this post will focus on: a) A very brief intro to how we collect data at SVT, b) The chaos that can result from the changes in the business rules, c) How we use a central schema (with Protobuf) to define the data we collect and propagate this definition in the whole pipeline (from collection to storage), d) Some of the issues we encountered with that choice...
  • Bayesian Neural Networks Need Not Concentrate
    Proponents of Bayesian neural networks often claim that trained BNNs output distributions which capture epistemic uncertainty. Epistemic uncertainty is incredibly valuable for a wide variety of applications, and we agree with the Bayesian approach in general. However, we argue that BNNs require highly informative priors to handle uncertainty. We show that if the prior does not distinguish between functions that generalize and functions that don’t, Bayesian inference cannot provide useful uncertainties. This puts into question the standard argument that “uninformative priors” are appropriate when the true prior distribution is unknown...
  • Advbox: a toolbox to generate adversarial examples that fool neural networks
    In recent years, neural networks have been extensively deployed for computer vision tasks, particularly visual classification problems, where new algorithms reported to achieve or even surpass the human performance. Recent studies have shown that they are all vulnerable to the attack of adversarial examples..Advbox is a toolbox to generate adversarial examples that fool neural networks in PaddlePaddle, PyTorch, Caffe2, MxNet, Keras, TensorFlow and it can benchmark the robustness of machine learning models. Compared to previous work, our platform supports black box attacks on Machine-Learning-as-a-service, as well as more attack scenarios, such as Face Recognition Attack, Stealth T-shirt, and Deepfake Face Detect...
  • A Connectome of the Adult Drosophila Central Brain
    The neural circuits responsible for behavior remain largely unknown. Previous efforts have reconstructed the complete circuits of small animals, with hundreds of neurons, and selected circuits for larger animals. Here we (the FlyEM project at Janelia and collaborators at Google) summarize new methods and present the complete circuitry of a large fraction of the brain of a much more complex animal, the fruit fly Drosophila melanogaster...From the resulting data we derive a better definition of computational compartments and their connections; an exhaustive atlas of cell examples and types, many of them novel; detailed circuits for most of the central brain; and exploration of the statistics and structure of different brain compartments, and the brain as a whole...
  • Open Sourcing Manifold, a Visual Debugging Tool for Machine Learning
    In January 2019, Uber introduced Manifold, a model-agnostic visual debugging tool for machine learning that we use to identify issues in our ML models. To give other ML practitioners the benefits of this tool, today we are excited to announce that we have released Manifold as an open source project...Manifold helps engineers and scientists identify performance issues across ML data slices and models, and diagnose their root causes by surfacing feature distribution differences between subsets of data. At Uber, Manifold has been part of our ML platform, Michelangelo, and has helped various product teams at Uber analyze and debug ML model performance...
  • 5 Reasons to Read Hands-On Machine Learning by Aurélien Géron (Book Review)
    Reading a book is one way to learn a new skill, but real mastery only comes from doing the thing you’re trying to learn...Aurélien Géron worked as a Product Manager at YouTube where he led the development of machine learning for video classification. His experience as a practitioner is evident in Hands-On Machine Learning as each chapter is filled with practical advice and realistic techniques for building machine learning models in industry... So here are 5 reasons why Hands-On Machine Learning is hands-down my favorite resource for learning how to build machine learning models...


Launch your new career in data science today!

The Data Science Career Track is a 6-month, self-paced online course that will pair you with your own industry expert mentor as you learn skills like data wrangling and data storytelling, and build your unique portfolio to stand out in the job market. Land your dream job as data scientist within six months of graduating or the course is free.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!


  • Senior Data Scientist - TRANZACT - NJ or Raleigh, NC

    Tranzact is a fast paced, entrepreneurial company offering a well-rounded suite of marketing solutions to help insurance companies stay ahead of the competition. The Senior Data Scientist will be solving the toughest problems at Tranzact by using data. More specifically, responsible for gathering data, conducting analysis, building predictive algorithms and communicating findings to drive profitable growth and performance across Tranzact. Must have a strong grasp on the data structure, business needs, and statistical and predictive modeling. Minimum 7 years of experience building predictive algorithms...

        Want to post a job here? Email us for details >>

Training & Resources

  • How to Build Your Own Logistic Regression Model in Python
    The name of this algorithm could be a little confusing in the sense that the Logistic Regression machine learning algorithm is for classification tasks and not regression problems. The name ‘Regression’ here implies that a linear model is fit into the feature space. This algorithm applies a logistic function to a linear combination of features to predict the outcome of a categorical dependent variable based on predictor variables. Logistic regression algorithms help estimate the probability of falling into a specific level of the categorical dependent variable based on the given predictor variables...
  • ML impossible: Train 1 billion samples in 5 minutes on your laptop using Vaex and Scikit-Learn
    Imagine that we have a dataset containing over 1 billion samples, which we need to use for training of a machine learning model. Due to the sheer amount alone, exploring such dataset already becomes tricky, while iterating on the cleaning, pre-processing and training steps becomes a daunting task...In this article, I will demonstrate how anyone can train a machine learning model on a billion samples in a swift and efficient manner. Your laptop is all the infrastructure you need. Just make sure it is plugged in...
  • NumPy Joins Twitter
    Hello, world!...This is the official account for NumPy, the fundamental package for scientific computing with Python...Follow us for news, info and content related to NumPy!...


  • Data Science in Production: Building Scalable Model Pipelines with Python
    This book provides a hands-on approach to scaling up Python code to work in distributed environments in order to build robust pipelines. Readers will learn how to set up machine learning models as web endpoints, serverless functions, and streaming pipelines using multiple cloud environments. It is intended for analytics practitioners with hands-on experience with Python libraries such as Pandas and scikit-learn, and will focus on scaling up prototype models to production....
    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page


    P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

Easy to unsubscribe at any time. Your e-mail address is safe.