Data Science Weekly Newsletter

Issue

408

September 16, 2021

‍

Editor's Picks

‍

The one thing I find most PhD & Master students need to unlearn
there is one thing I find most students need to unlearn: the work mentality acquired during years of tests and homework. Let me explain...

ICLR 2022 Call for Blog Posts
This year, the ICLR 2022 main conference will host a blog post track. We invite both academic and industrial researchers to submit their posts on a previously published paper at ICLR. We particularly welcome submissions on papers that appeared last year at ICLR...

Our Journey towards Data-Centric AI: A Retrospective
Starting in about 2016, researchers from our lab — the Hazy Research lab — circled through academia and industry giving talks about an intentionally provocative idea: machine learning (ML) models—long the darlings of researchers and practitioners—were no longer the center of AI. In fact, models were becoming commodities. Instead, we claimed that it was the training data that would drive progress towards more performant ML models and systems...

‍

A Message From This Week's Sponsor

‍

TransformX Conference: Driving AI from Experimentation to Reality Join Scale AI for our two-day, virtual conference featuring 100+ speakers and 60+ sessions. We’re bringing together a community of leaders, visionaries, practitioners, and researchers across industries as we explore the shift from research to reality within AI and Machine Learning. Register now to secure your free ticket...

‍

Data Science Articles & Videos

‍

The mathematics of adversarial attacks in AI
It is well established that the current DL methodology produces universally unstable neural networks (NNs). The instability problem has caused an enormous research effort -- with a vast literature on so-called adversarial attacks -- yet there has been no solution to the problem. Our paper addresses why there has been no solution to the problem, as we prove the following mathematical paradox: any training procedure based on training neural networks for classification problems with a fixed architecture will yield neural networks that are either inaccurate or unstable (if accurate) -- despite the provable existence of both accurate and stable neural networks for the same classification problems...

Parallelizing Python Code
Python is great for tasks like training machine learning models...When performing these tasks, you also want to use your underlying hardware as much as possible for quick results. Parallelizing Python code enables this. However, using the standard CPython implementation means you cannot fully use the underlying hardware because of the global interpreter lock (GIL) that prevents running the bytecode from multiple threads simultaneously...This article reviews some common options for parallelizing Python code...

Using learning-to-rank to precisely locate where to deliver packages
For delivery drivers, finding the doorstep where a package should be dropped off can be surprisingly hard. House numbers can be obscured by foliage, or they might be missing entirely; some neighborhoods use haphazard numbering systems that make house numbers hard to guess; and complexes of multiple buildings sometimes share a single street address...I adapt an idea from information retrieval — learning-to-rank — to the problem of predicting the coordinates of a delivery location from past GPS data...

Building a smart Robot AI using Hugging Face 🤗 and Unity
Today we’re going to build this adorable smart robot that will perform actions based on player text input...It uses a deep language model to understand any text input and find the most appropriate action of its list...What’s interesting with that system, contrary to classical game development, is that you don’t need to hard-code every interaction. Instead, you use a language model that selects what’s robot possible action is the most appropriate given user input...

Bayesian Media Mix Modeling for Marketing Optimization
A problem faced by many companies is how to allocate marketing budgets across different media channels. For example, how should funds be allocated across TV, radio, social media, direct mail, or daily deals?...So-called Media Mix Modelling (MMM) can estimate how effective each advertising channel is in gaining new customers. Once we have estimated each channel’s effectiveness we can optimize our budget allocation to maximize customer acquisition and sales...In this blog post, we outline what you can do with MMM’s, introduce how they work, summarise some of the benefits they can provide, as well as covering some of the modeling challenges...

Bad Labels: GridSearch is Not Enough
I write a lot of blog posts on why you need more than grid-search to properly judge a machine learning model. In this blog post I want to demonstrate yet another reason; labels often seem to be wrong...The issue here isn’t just that we might have bad labels in our training set, the issue is that it appears in the validation set. If a machine learning model can become state of the art by squeezing another 0.5% out of a validation set one has to wonder. Are we really making a better model? Or are we creating a model that is better able to overfit on the bad labels?...

Anomaly Detection: Why Your Data Team Is Just Not That Into It
Here’s why and how some of the best data teams are turning to DevOps and Site Reliability Engineering for inspiration when it comes to achieving a proactive, iterative model for data trust. Introducing: the Data Reliability lifecycle...

bad labels: introduction
Even famous datasets have bad labels in them...Because it's such a big problem we wanted to spend a few videos on this topic. It'd be a shame if our machine learning models are merely optimal because they overfit on the bad labels. That's why we're going to explore heuristics to find bad labels in our training data so that we may try to improve the quality of our training data...

Embedding Values in Artificial Intelligence (AI) Systems
Though there are numerous high-level normative frameworks, it is still quite unclear how or whether values can be implemented in AI systems. Van de Poel and Kroes’s (2014) have recently provided an account of how to embed values in technology. The current article proposes to expand that view to complex AI systems and explain how values can be embedded in technological systems that are “autonomous, interactive, and adaptive”...

How To Lead In Data Science
The Data Exchange Podcast: Jike Chong and Yue Cathy Chang on helping data scientists increase their impact in business and in society...

‍

Training

‍

Quick Question For You: Do you want a Data Science job? After helping hundred of readers like you get Data Science jobs, we've distilled all the real-world-tested advice into a self-directed course. The course is broken down into three guides:

Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)

Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate

Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!

Click here to learn more... *Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Senior Data Scientist - TikTok - LA

TikTok is the leading destination for short-form mobile video. Our mission is to inspire creativity and bring joy by offering a home for creative expression and an experience that is genuine, joyful, and positive.
- Generate useful features from large amount of data
- Apply supervised and unsupervised machine learning techniques, such as linear and logistic regression, decision trees, and k-means clustering
- Develop segmentation models, classification models, propensity models, LTV models, experimental design, optimization models
- Perform statistical analysis such as KPI deep dives, performance marketing efficiency, behavioral clustering, and user journey analytics
- Curate audiences and inform engagement tactics to enable differentiated, relevant marketing touches across channels (social, email, in app, push)
- Synthesize analytics and statistical approaches into easy-to-consume storylines, both visually and verbally, and provide indicated actions for executive audiences
- Capture business requirements for data and analytic solutions and collaborate XFN to ensure business requirements align with business needs
- Analyze creatives and surface insights that will help drive engagement and retention
- Support day-to-day collaboration with performance marketing to communicate insights and recommend data informed strategies

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

How percentile approximation works (and why it's more useful than averages)
As I was researching this piece, I found a number of good blog posts (see examples from the folks at Dynatrace, Elastic, AppSignal, and Optimizely) about how averages aren’t great for understanding application performance, or other similar things, and why it’s better to use percentiles...I won’t spend too long on this, but I think it’s important to provide a bit of background on why and how percentiles can help us better understand our data...First off, let’s consider how percentiles and averages are defined. To understand this, let’s start by looking at a normal distribution...

State of PyTorch core: September 2021 edition
There are a lot of projects currently going on in PyTorch core and it can be difficult to keep track of all of them or how they relate with each other. Here is my personal understanding of all the things that are going on, organized around the people who are working on these projects, and how I think about how they relate to each other...

Data Visualization for Machine Learning Practitioners [Video]
Originally presented at R/Medicine 2021 by Julia Silge...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍