Data Science Weekly Newsletter

Issue

427

January 27, 2022

‍

Editor's Picks

‍

Experiment without the wait: Speeding up the iteration cycle with Offline Replay Experimentation
Online experimentation is often used to evaluate product ideas, but it is costly and time-consuming. Could we predict experiment outcomes without even running an experiment? Could it be done in hours instead of weeks? This post will describe how Pinterest uses offline replay experimentation to predict experiment results in advance. ...

DeepMind: The Podcast, Season 2 (12 episodes)
In the highly-praised, award-nominated "DeepMind: The Podcast", mathematician and broadcaster Hannah Fry goes behind the scenes of world-leading research lab DeepMind to find out how AI can benefit our lives and the society we live in....

Overdebunked! Six Statistical Critiques That Don’t Quite Work
Statistical results and data analyses are quite often wrong. Sometimes they’re wrong because of carelessness, sometimes they’re wrong even though we cared a lot because it’s just really hard to get them right, and other times they’re wrong on purpose. It shouldn’t shock anyone to hear this...below, I’ve listed six statistical critiques I commonly see on social media, and why they’re not great critiques...These aren’t technical errors - they’re not about misinterpreting a p-value or whatever, but more about common-sense critiques of published statistical results that anyone could make...

‍

A Message From This Week's Sponsor

‍

A Two-Day Virtual Interactive ML Community Event for AI/ML Developers and Data Scientists. Learn from 35+ AI experts from DeepMind, Spotify, Twitter, Disney, HuggingFace, Instacart, Colgate, Linkedin, Pinterest, Mobileye, HSBC, AstraZeneca, Verizon, BBC and more in sessions about building real-world AI and machine learning applications, best practices and strategies in AI infrastructure, ML in production, and exciting research that you can apply to your next ML or DL projects.

‍

Data Science Articles & Videos

‍

The Non-Engineer’s Guide to Bad Data
This article is written by a data engineer for a non-technical audience troubleshooting the “broken dashboard” problem and can help data teams educate their stakeholders on the process of tackling broken data pipelines...the reader will learn: a) The role the data engineering team plays in troubleshooting data quality issues and their current responsibilities, b) The impact "bad" data can have on their business, c) A simplified explanation of why data breaks and why it takes time to discover and fix data quality issues, and d) And how data teams rely on data observability to reduce the likelihood of "bad" data entering their Tableau or Looker dashboards and reports...

ML and NLP Research Highlights of 2021
2021 saw many exciting advances in machine learning (ML) and natural language processing (NLP). In this post, I will cover the papers and research areas that I found most inspiring...1) Universal Models, 2) Massive Multi-task Learning, 3) Beyond the Transformer, 4) Prompting, 5) Efficient Methods, 6) Benchmarking, 7) Conditional Image Generation, 8) ML for Science, 9) Program Synthesis, 10) Bias, 11) Retrieval Augmentation, 12) Token-free Models, 13) Temporal Adaptation, 14) The Importance of Data, and 15) Meta-learning...

How do you document predictive models just in case they are audited?
[Reddit Discussion]...I work at a bank and am about to start building my first predictive model. I'm curious how you document your models in case auditors ask to see them? I'm also meeting with our internal auditors next week to come up with a plan, but I'd love to know what you do at your organization if you are willing to share...

Two reasons Kubernetes is so complex
While some of those feelings are fairly universal of learning any new system, Kubernetes really does feel a lot bigger, scarier, and more intractable than some other systems I’ve worked with. As I’ve learned it and worked with it, I’ve tried to understand why it looks the way it does, and which design decisions and tradeoffs lead to it looking the way it does. I don’t claim to have the full answer, but this post is an attempt to commit to paper two specific thoughts or paradigms I have that I reach for as I try to understand why working with Kubernetes feels so hairy sometimes....

How to navigate ML research literature
My slides on "How to navigate ML research literature" for Winter ML school...How to read papers?...How to filter out?...Where to get?...What does peer review mean?...

How to run effective ML research
Gave a talk about ML papers reproducibility at winter school on "How to run effective ML research". Discussed some challenges 🥲 during implementation, objectives , and tips ✍️.Here are the slides...

Asset2Vec: Turning 3D Objects into Vectors and Back
How we used NeRF to embed our entire 3D object catalogue to a shared latent space, and what it means for the future of graphics...

Yann LeCun: Dark Matter of Intelligence and Self-Supervised Learning
Lex Fridman Podcast #258 (Yann LeCun's second time)...Yann LeCun is the Chief AI Scientist at Meta, professor at NYU, Turing Award winner, and one of the seminal researchers in the history of machine learning...

Beginner mistakes to avoid in building Data Pipeline
[Reddit Discussion] I've recently been promoted to a Data Engineering position at work. That being said, my first project is helping migrate data from SAP ECC to SQL Server and solidify our data pipeline so my Analytics team can extract data in a more streamlined way for our dashboards and modeling...I don't have much guidance from technical leadership or access to technical expertise in this undertaking, and I wanted to see if there were any Sr. DE's that had common "rookie" mistakes they've seen in similar initiatives that I should look out for...

Topology and Computability
Readers of this blog are familiar with notions of computability – basically, the question is, what can machines do without human assistance? And you are familiar with machines. Electronic ones of course, but I always like to think of machines as composed of gears, levers and pulleys...Topology? That’s another story. Rubber doughnuts being continuously stretched but always preserving that hole. Or calculus and differential equations...So what’s the connection? You’d be surprised...

‍

Forum

‍

Check out the new Anaconda Community for all-things data! Want insights into the newest developments in the world of data, or need help getting “unstuck” on a problem? Our Community Forums is the place to go! Be the first to engage with other professionals and ask questions to the broader data community. Users can join in conversations around trends, debate new features, post questions to the community, and more. Plus, it’s another avenue for technical help! Create your free Anaconda Community account now.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

(Senior) Analytics Engineer - Fabulous - Remote Fabulous is a mobile app helping thousands of people every day to change their lifestyles by integrating healthy habits into their lives. Fabulous is using a behavioral economics lens to help everyone achieve their fullest potential. We work closely with researchers based at Duke University and our advisor is Dan Ariely, author of NYT bestseller Predictably Irrational. We are looking for an experienced Analytics Engineer to consolidate the Data Science team and lead the development and enrichment of our Data Pipelines. We have a modern Data-Stack based on Fivetran, dbt, BigQuery, Amplitude, Metabase...

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

A method for explaining machine learning models: Shapley values (SHAP)
A prediction can be explained by assuming that each feature value of the instance is a “player” in a game where the prediction is the payout. Shapley values – a method from coalitional game theory – tells us how to fairly distribute the “payout” among the features...Shapley values: a) Model-agnostic: Use with any model, b) Theoretic foundation: Game theory, c) Good software ecosystem, and d) Local and global explanations...

Regression and Other Stories [Book PDF, Free]
Most textbooks on regression focus on theory and the simplest of examples. Real statistical problems, however, are complex and subtle. This is not a book about the theory of regression. It is about using regression to solve real problems of comparison, estimation, prediction, and causal inference. Unlike other books, it focuses on practical issues such as sample size and missing data and a wide range of goals and techniques. It jumps right in to methods and computer code you can...

Modern Robotics: Mechanics, Planning, and Control [Book PDF, Free]
This introduction to robotics offers a distinct and unified perspective of the mechanics, planning and control of robots. Ideal for self-learning, or for courses, as it assumes only freshman-level physics, ordinary differential equations, linear algebra and a little bit of computing background. Modern Robotics presents the state-of-the-art, screw-theoretic techniques capturing the most salient physical features of a robot in an intuitive geometrical way. With numerous exercises at the end of each chapter, accompanying software written to reinforce the concepts in the book and video lectures aimed at changing the classroom experience, this is the go-to textbook for learning about this fascinating subject...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍