Data Science Weekly Newsletter

Issue

398

July 8, 2021

‍

Editor's Picks

‍

Pinterest Visual Signals Infrastructure:
Evolution from Lambda to Kappa Architecture
With the growing need for machine learning signals from Pinterest’s huge visual dataset, we decided to take a closer look at our infrastructure that produces and serves these signals. A few parameters we were particularly interested in were signal availability, infra complexity and cost optimization, tech integration, developer velocity, and monitoring. In this post, we will describe our journey from a Lambda architecture to the new real-time signals infrastructure inspired by Kappa architecture...

Building a Gigascale ML Feature Store with Redis, Binary Serialization, String Hashing, and Compression
When a company with millions of consumers such as DoorDash builds machine learning (ML) models, the amount of feature data can grow to billions of records with millions actively retrieved during model inference under low latency constraints. These challenges warrant a deeper look into selection and design of a feature store — the system responsible for storing and serving feature data. The decisions made here can prevent overrunning cost budgets, compromising runtime performance during model inference, and curbing model deployment velocity...

Data Lake vs. Warehouse: How to Choose the Right Solution for Your Stack
Vendor-agnostic discussion of how to choose between using a data warehouse, lake, or even lakehouse for your data platform, and how it's not so much about choosing a horse for the race as picking the right tools for the job...

‍

A Message From This Week's Sponsor

‍

Struggling to proactively assess AI performance in production?

Caught too late, AI performance mishaps hurt your business. Mona's highly configurable performance monitoring platform enables teams to go from reactive to proactive with automatic alerts and troubleshooting for biases, concept drifts, data integrity issues, and more. Try Mona for free today!

‍

Data Science Articles & Videos

‍

Interpretability in Machine Learning: An Overview
This essay provides a broad overview of the sub-field of machine learning interpretability. While not exhaustive, my goal is to review conceptual frameworks, existing research, and future directions...

A Review of Coffee Data: Grades and Flavors
After trying a variety of coffees from around the world, I have often wondered how flavor differences affect grading (Q-grades or Cupping Grades). Even though I have found a general correlation between coffee grade and taste, I have really enjoyed even lower graded coffees. I have looked at two databases with coffee grades, and there are definitely regional differences, but I still didn’t have an idea how more specific flavors played a role...

Using Reddit to explore the mental health effects of the COVID-19 era.
Preliminary research by the CDC has indicated that the COVID-19 pandemic has affected our health, both physically and mentally. Anxiety, depression, suicidal thoughts, and substance abuse disorders appear to be on the increase. I thought it might be interesting to look at subreddits specific to mental health issues and see what topics people are discussing the most and to see if there is increased engagement with these communities now that more people are distanced from physical sources of support and information...

Calibrating Deep Neural Networks using Focal Loss
Miscalibration - a mismatch between a model's confidence and its correctness - of Deep Neural Networks (DNNs) makes their predictions hard to rely on. Ideally, we want networks to be accurate, calibrated and confident. We show that, as opposed to the standard cross-entropy loss, focal loss [Lin et. al., 2017] allows us to learn models that are already very well calibrated...

create-ml-app
A few months ago, I started using Makefiles for my local Python ML projects. Ever since, I haven’t manually dealt with venv or pip installs. It’s not life-changing, but I now can’t imagine starting a local ML project without a Makefile. Here’s a template...

Approximate Nearest Neighbor Search in Vespa — Part 1
This blog post is part 1 in a series of blog posts where we share how the Vespa team implemented an approximate nearest neighbor (ANN) search algorithm. In this first post, we’ll explain why we selected HNSW (Hierarchical Navigable Small World Graphs) as the baseline algorithm and how we extended it to meet the requirements for integration in Vespa...

2020’s Top AI & Machine Learning Research Papers
Despite the challenges of 2020, the AI research community produced a number of meaningful technical breakthroughs. For example, teams from Google introduced a revolutionary chatbot, Meena, and EfficientDet object detectors in image recognition. Researchers from Yale introduced a novel AdaBelief optimizer that combines many benefits of existing optimization methods. OpenAI researchers demonstrated how deep reinforcement learning techniques can achieve superhuman performance in Dota 2. To help you catch up on essential reading, we’ve summarized 10 important machine learning research papers from 2020...

Introducing the NeurIPS 2020 Main Program
We have just released the full schedule, two weeks ahead of the beginning of the conference, to let people familiarize themselves with it and plan their attendance accordingly, since it is hard to both attend the conference and maintain our busy daily schedule otherwise...

Differentially Private Learning Needs Better Features (or Much More Data)
We demonstrate that differentially private machine learning has not yet reached its "AlexNet moment" on many canonical vision tasks: linear models trained on handcrafted features significantly outperform end-to-end deep neural networks for moderate privacy budgets. To exceed the performance of handcrafted features, we show that private learning requires either much more private data, or access to features learned on public data from a similar domain...

‍

Training

‍

Quick Question For You: Do you want a Data Science job?

After helping hundred of readers like you get Data Science jobs, we've distilled all the real-world-tested advice into a self-directed course.
The course is broken down into three guides:

Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)

Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate

Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!

Click here to learn more
...
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Data Scientist - Apple Pay Analytics - NYC

You will play a key role improving the Apple Pay product experience. As a member of the analytics team you will be supporting a product function. You will partner with business owners, understand goals, craft KPIs and measure ongoing performance. You will initially engage with the product and engineering teams in ensuring that we have the appropriate instrumentation in place to deliver on these metrics. You will subsequently use advanced statistical, ML and analytical techniques to analyze product performance and identify key insights that inform product improvements and business strategy. The role requires a high degree of independence, ownership and collaboration working cross functionally across all levels of a highly matrixed organization...

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

Rich
Rich is a Python library for rich text and beautiful formatting in the terminal. The Rich API makes it easy to add color and style to terminal output. Rich can also render pretty tables, progress bars, markdown, syntax highlighted source code, tracebacks, and more — out of the box....

Reproducible and upgradable Conda environments:
Dependency management with conda-lock
If your application uses Conda to manage dependencies, you face a dilemma. On the one hand, you want to pin all your dependencies to specific versions, so you get reproducible builds. On the other hand, once you’ve pinned everything, upgrades become difficult: you’ll start encountering the infamous The following specifications were found to be incompatible with each other error. Ideally you’d be able to both have a consistent, reproducible build, and still be able to quickly change your dependencies. And you can do this—with a little understanding, and a bit more work...

Notebook demonstrating zero-shot classification
Code and comments from reddit...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page
.

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍