Data Science Weekly Newsletter - Issue 394

Issue #362

Oct 29 2020

Editor Picks
  • To apply AI for good, think form extraction
    Folks who want to use AI/ML for good generally think of things like building predictive models, but smart methods for extracting data from forms would do more for journalism, climate science, medicine, democracy etc. than almost any other application...Today we are announcing Deepform, a baseline ML model, training data set, and public benchmark where anyone can submit their solution...This post has four parts: 1) Why form extraction is an important problem, 2) Why it’s hard, 3) The start of the art of deep learning for form extraction, and 4) The Deepform model and dataset, our contribution to this problem...
  • Data Engineering Is Software Engineering
    Recently, a coworker asked me what the difference is between data science and data engineering...I started to explain: Data engineering is getting data, cleaning data, reshaping data, validating data, and loading it into databases. Data science is all of that, plus analyzing the data and figuring out how to display it in a way that makes sense, and sometimes also building models and doing machine learning...They seemed somewhat enlightened by this answer, but I didn’t love it, because there’s a lot more to it than that...So I wanted to write something here about what data engineers do all day, because I’ve noticed a belief in many engineering orgs that data engineers are somehow not doing “real” engineering. And there’s a lot of confusion about what data scientists do with regard to data engineering...

A Message from this week's Sponsor:


Data scientists are in demand on Vettery

Get discovered by one of the thousands of hiring managers using Vettery to grow their companies’ data science teams. Here’s how it works: create a profile, name your salary, and connect with hiring managers from startups to Fortune 500 companies.

Get started - it’s completely free for job-seekers!


Data Science Articles & Videos

  • Defining Data Intuition
    Last week, one of my peers asked me to explain what I meant by "Data Intuition", and I realized I really didn't have a good definition. That's a problem! I refer to data intuition all the time!...Data intuition is one of the three skills I interview new data scientists for (along with statistics and technical skills). In fact, I just spent the first nine months of 2020 building Mozilla's data intuition. I'm really surprised to realize I can't point to a good explanation of what I'm trying to cultivate...So - I'll make one up. I propose the following definition for Data Intuition: Data Intuition is a resilience to misleading data and analyses...In other words, it's harder to mislead someone with data if they have strong data intuition. Think of this as a defense against the dark data arts...So what does that look like in practice?...
  • A Bayesian Perspective on Q-Learning
    Recent work by Dabney et al. suggests that the brain represents reward predictions as probability distributions...This contrasts against the widely adopted approach in reinforcement learning (RL) of modelling single scalar quantities (expected values). In fact, by using distributions we are able to quantify uncertainty in the decision-making process...The purpose of this article is to clearly explain Q-Learning from the perspective of a Bayesian. As such, we use a small grid world and a simple extension of tabular Q-Learning to illustrate the fundamentals. Specifically, we show how to extend the deterministic Q-Learning algorithm to model the variance of Q-values with Bayes' rule...
  • Rethinking Attention with Performers
    Transformer models have achieved state-of-the-art results across a diverse range of domains, including natural language, conversation, images, and even music...The core block of every Transformer architecture is the attention module, which computes similarity scores for all pairs of positions in an input sequence. This however, scales poorly with the length of the input sequence, requiring quadratic computation time to produce all similarity scores, as well as quadratic memory size to construct a matrix to store these scores...To resolve these issues, we introduce the Performer, a Transformer architecture with attention mechanisms that scale linearly, thus enabling faster training while allowing the model to process longer lengths, as required for certain image datasets such as ImageNet64 and text datasets such as PG-19...
  • Struggling with data imbalance? Semi-supervised & Self-supervised learning help!
    Let me introduce to you our latest work, which has been accepted by NeurIPS 2020: Rethinking the Value of Labels for Improving Class-Imbalanced Learning. This work mainly studies a classic but very practical and common problem: the classification problem under the imbalance of data categories (also referred to as the long-tailed distribution of data)...To begin with, I would like to first summarize the main contribution of this article in one sentence: We have verified both theoretically and empirically that, for learning problems with imbalanced data (categories), using a) Semi-supervised learning...or, b) Self-supervised learning...can both greatly improve the model performance. Their simplicity and versatility also make it easy to combine them with different classic methods to further enhance the learning results...Next, we will enter the main text. I will first introduce the background of the data imbalance problem and some of the current research status. Then I will introduce our ideas and methods and omit unnecessary details...
  • What Twitter learned from the Recsys 2020 Challenge
    Twitter partnered with the RecSys conference to sponsor the 2020 challenge...The participants of the challenge were asked to predict the probability of a user engaging with any of the four interactions: Like, Reply, Retweet, and Quote tweet...In this post, we describe the dataset and the three winning entries submitted by Nvidia, Learner, and Wantely teams. We try to make general conclusions about the choices that helped the winners achieve their results, notably: a) most important features, b) extremely fast experimentation speed for feature selection and model training, c) adversarial validation for generalisation, d) use of content features, and e) use of decision trees over neural networks...We hope that these findings will be useful to the wider research community and inspire future research directions in recommender systems...
  • The unreasonable effectiveness of synthetic data. Podcast with Daeil Kim
    Daeil Kim is the co-founder and CEO of AI.Reverie, a startup that specializes in creating high quality synthetic training data for computer vision algorithms. Before that, he was a senior data scientist at the New York Times. And before that he got his PhD in computer science from Brown University, focusing on machine learning and Bayesian statistics. He's going to talk about tools that will advance machine learning progress, and he's going to talk about synthetic data...
  • Switchback Tests and Randomized Experimentation Under Network Effects at DoorDash
    On the Dispatch team at DoorDash, we use simulation, empirical observation, and experimentation to make progress towards our goals; however, given the systemic nature of many of our products, simple A/B tests are often ineffective due to network effects. To be able to experiment in the face of network effects, we use a technique known as switchback testing, where we switch back and forth between treatment and control in particular regions over time. This approach resembles A/B tests in many ways, but requires certain adjustments to the analysis...
  • Data Quality for Everyday Analysis
    In this post I will introduce the basic concepts behind Data Quality, discuss the cost of bad data, review six dimensions of Data Quality assessment and go through tools and techniques that can be used to deal with quality issues when they arise....A while ago, a friend of mine presented a compelling analysis that convinced the managers in a mid-size company to make a series of decisions based on the recommendations of the newly-established data science team. However, not long after, a million dollar loss revealed that the insights were wrong. Further investigations showed that while the analysis was sound, the data that was used was corrupt...



Quick Question For You: Do you want a Data Science job?

After helping hundred of readers like you get Data Science jobs, we've distilled all the real-world-tested advice into a self-directed course.

The course is broken down into three guides:
  1. Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)

  2. Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate

  3. Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!
Click here to learn more ...

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!



  • Data Scientist - Associated Press (AP) - New York, NY

    The Associated Press is the essential global news network, delivering fast, unbiased news from every corner of the world to all media platforms and formats. Founded in 1846, AP today is the largest and most trusted source of independent news and information. On any given day more than half the world's population sees news from AP.

    The Associated Press seeks a Data Science Manager based in New York, NY. The Data Science Manager will help manage data analysis, data science and data engineering solutions supporting business intelligence, news search, content enrichment and metadata services. We are a small focused team within Metadata Technology working closely with various departments and functions across the organization to design and build solutions with data, analytics and machine learning methods...

        Want to post a job here? Email us for details >>


Training & Resources

  • Landmark Papers in Machine Learning
    This document attempts to collect the papers which developed important techniques in machine learning. Research is a collaborative process, discoveries are made independently, and the difference between the original version and a precursor can be subtle, but I’ve done my best to select the papers that I think are novel or significant...
  • Introduction to Linear Algebra for Applied Machine Learning with Python
    Linear algebra is to machine learning as flour to bakery: every machine learning model is based in linear algebra, as every cake is based in flour. It is not the only ingredient, of course. Machine learning models need vector calculus, probability, and optimization, as cakes need sugar, eggs, and butter. Applied machine learning, like bakery, is essentially about combining these mathematical ingredients in clever ways to create useful (tasty?) models...This document contains introductory level linear algebra notes for applied machine learning. It is meant as a reference rather than a comprehensive review...



  • Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

    Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

    P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian
Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.