Receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe at any time. Your e-mail address is safe.

Data Science Weekly Newsletter
January 6, 2022

Editor's Picks

  • Managing the First Year - Thoughts on being a new data science manager
    It was during my second week that I met with my manager and understood that no, I’d been hired to replace her as the team’s manager...For the next 18 months I stayed in that role, directly managing a team of 4-8 data scientists. That time was a firehose of learning – some lessons I had sought out, and others than landed on my head without invitation...I’m not a management expert, but I did try really hard during my first year managing, and I’ve since spent time digesting the experience. My hope is that others will find a few of the things I learned useful when they’re at the start of their own management journey...
  • Chatbots: Still Dumb After All These Years
    Intelligence is more than statistically appropriate responses...I posed this commonsense question: "Is it safe to walk downstairs backwards if I close my eyes?"...Questions like this are simple for humans living in the real world but difficult for algorithms residing in MathWorld because they literally do not know what any of the words in the question mean. GPT-3’s answer was authoritative, confusing, and contradictory...
  • Real-time machine learning: challenges and solutions
    A year ago, I wrote a post on how machine learning is going real-time. The post must have captured many data scientists’ pain points because, after the post, many companies reached out to me sharing their pain points and discussing how to move their pipelines real time...In the last year, I’ve talked to ~30 companies in different industries about their challenges with real-time machine learning. I’ve also worked with quite a few to find the solutions. This post outlines the solutions for (1) online prediction and (2) continual learning, with step-by-step use cases, considerations, and technologies required for each level...

A Message From This Week's Sponsor

Free Course: Natural Language Processing (NLP) for Semantic Search Learn how to build semantic search applications by making machines understand language as people do. This free course covers everything you need to build state-of-the-art language models, from machine translation to question-answering, and more. Brought to you by Pinecone. Start reading now.

Data Science Articles & Videos

  • Real-World Machine Learning Research To Production [Video]
    In this talk, Austin Huang (Vice President, AI & Machine Learning, Fidelity) explains how machine learning use cases have changed - evolving from batch prediction pipelines to real-time consumers of unstructured data. These use cases have also given rise to new opportunities for innovation in model development. Whereas in the past machine learning projects were often impeded by the availability of labeled data, we share examples of programmatic data generation such as simulation and distillation. Finally, we discuss human interfaces to machine learning models - highlighting considerations such as inference latency and aligning model architectures with user experience integration...
  • Defining AI in Policy versus Practice
    With an eye towards practical working definitions and a broader understanding of positions on these issues, we survey experts and review published policy documents to examine researcher and policy-maker conceptions of AI. We find that while AI researchers favor definitions of AI that emphasize technical functionality, policy-makers instead use definitions that compare systems to human thinking and behavior...
  • Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering
    Through systematic ablation experiments and qualitative visualizations, we verify that collective outliers are a general phenomenon responsible for degrading pool-based active learning. Notably, we show that active learning sample efficiency increases significantly as the number of collective outliers in the active learning pool decreases. We conclude with a discussion and prescriptive recommendations for mitigating the effects of these outliers in future work...
  • Neural Network From Scratch
    In this edition of Napkin Math, we'll invoke the spirit of the Napkin Math series to establish a mental model for how a neural network works by building one from scratch. In a future issue we will do napkin math on performance, as establishing the first-principle understanding is plenty of ground to cover for today!...
  • A Coding Assistant for Data Science and Machine Learning
    Since the publication and dissemination of GPT-3, coding assistants like Github copilot, powered by OpenAi’s codex API have been on the radar of the machine learning community for quite a while. Recently, I came across this tool called Cogram, which seems to be a type of evolution of autocompletion, specialized for data science and machine learning that runs directly on Jupyter Notebooks. In this article, I will show you how this tool works and share a little bit of my experience with it so far, generating machine learning code on Jupyter Notebooks...
  • “My data drifted. What’s next?” How to handle ML model drift in production.
    I have a model in production, and the data is drifting. How to react?”...This data drift might be the only signal. You are predicting something, but don’t know the facts yet. Statistical change in model inputs and outputs is the proxy. The data has shifted, and you suspect a decay in the model performance...In other cases, you can know it for sure. You can calculate the model quality or business metrics. Accuracy, mean error, fraud rates, you name it. The performance got worse, and the data is different, too...What can you do next?...Here is an introductory overview of the possible steps...
  • Bayesian Statistics Overview and your first Bayesian Linear Regression Model
    A brief recap of Bayesian Learning followed by implementation of a Bayesian Linear Regression Model on NYC Airbnb open dataset...When I first started researching about this, I had many questions like, when is it beneficial to use Bayesian, how does the output differ from its non-Bayesian counterpart (Frequentist), how to define prior distribution, are there existing libraries in python for estimating posterior distribution, etc. I attempt to answer all these questions in this post, while keeping it brief...
  • Building models in JAX - Part 1
    I am starting a whole new series of tutorials where we will learn about the existing methods of building models in JAX. In this tutorial, we are going to build an image classifier purely in JAX. Here is the list of things that we will cover in this notebook: 1) Use the Cifar-10 dataset for training the classifier, 2) Build a classifier purely in JAX using no library other than JAX, 3) Data augmentation purely in JAX, 4) Create a custom training/testing loop in the most simplified manner, and 5) Discuss the pros and cons of this approach...
  • Effective Testing for Machine Learning (Part II)
    A progressive, step-by-step framework for developing robust ML projects...In this series’s first part, we started with a simple smoke testing strategy to ensure our code runs on every git push. Then, we built on top of it to ensure that our feature generation pipeline produced data with a minimum level of quality (integration tests) and verified the correctness of our data transformations (unit tests)...Now, we’ll add more robust tests: distribution changes, ensure that our training and serving logic is consistent, and check that our pipeline produces high-quality models...
  • The Magic of Integrating Factor
    One of the many techniques for solving ordinary differential equations involves using an integrating factor. An integrating factor is a function that we multiply a differential equation with to simplify it and make it integrable. It almost appears to work like magic!...


High quality data labeling, consistently Edge cases are the most common challenges that ML teams face when training their AI models, making it difficult to reach 95+% accuracy. This can be more complex once you need to scale and start working with 3rd party data labeling solutions. The evaluation metrics that we use to measure the quality of labeled data - Intersection over Union (IOU) and F1 score - has allowed us to make swift adjustments on the go and continuously improve the quality of our labeling standards. To find out more and start exploring our end-to-end data labeling service, speak to the team at Supahands today. *Sponsored post. If you want to be featured here, or as our main sponsor, contact us!


Training & Resources

  • ISLR tidymodels Labs
    This book aims to be a complement to the 2nd version An Introduction to Statistical Learning book with translations of the labs into using the tidymodels set of packages...The labs will be mirrored quite closely to stay true to the original material...
  • Deep Learning Interviews: Hundreds of fully solved job interview questions from a wide range of key topics in AI
    The second edition of Deep Learning Interviews is home to hundreds of fully-solved problems, from a wide range of key topics in AI. It is designed to both rehearse interview or exam specific topics and provide machine learning MSc / PhD. students, and those awaiting an interview a well-organized overview of the field. The problems it poses are tough enough to cut your teeth on and to dramatically improve your skills-but they're framed within thought-provoking questions and engaging stories...


P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

Receive the Data Science Weekly Newsletter every Thursday

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Easy to unsubscribe at any time. Your e-mail address is safe.