Data Science Weekly Newsletter

Issue

416

November 11, 2021

‍

Editor's Picks

‍

Lessons on ML Platforms — from Netflix, DoorDash, Spotify, and more
Your data scientists produce wonderful models, but they can only deliver value once the models are integrated into your production systems...Through scouring conference talks and blog posts from the past several years, I’ve documented ML platforms’ common components and capabilities at eleven large tech companies...This post contains: a) A high-level overview of common ML platform components, b) A table of tools used by each company, c) Observations about the components, d) The platform user experience, and e) A summary of capabilities unique to certain companies...

From Data Engineer to SysAdmin: Put down the K8s cluster, your pipelines can run without it
I’ve been operating Kubernetes (using EKS) in a data engineering team for almost three years now, and I’d be wary of using it if I had the choice in the future. This isn’t an anti-Kubernetes post, as I think Kubernetes (K8s) is a game-changing technology and would bet that our team’s investment in K8s will pay off over the longer term as our engineering headcount grows past one thousand. This post is a ‘be careful, you are not Google’ type post, with some specifics on how K8s ownership has proved frustrating and unsatisfying to an engineer whose actual goal is to help the business understand itself and build better products using data & ML...

GPT-3 is No Longer the Only Game in Town
GPT-3 was by far the largest AI model of its kind last year. Now? Not so much...the ability of people to build upon GPT-3 was hampered by one major factor: it was not publicly released...So, since last year multiple organizations have worked towards creating their own version of GPT-3, and as I’ll go over in this article at this point roughly half a dozen such gigantic GPT-3 esque models have been developed (though as with GPT-3, not yet publicly released)...

‍

A Message From This Week's Sponsor

‍

Pull data at any scale from your data warehouse PostHog is an open-source product analytics platform that can ingest data at any scale, even from data warehouses based on BigQuery, Snowflake, S3 or Redshift.

Once your data is in PostHog you can analyse it using funnels, trends, pathing visualizations and more. You can even integrate with other platforms, creating a data pipeline for on-going analysis.

Best of all, you can deploy PostHog on your own infrastructure in minutes. Deploy PostHog today for free.

‍

Data Science Articles & Videos

‍

The Turing Test Is Bad For Business
Fears of artificial intelligence fill the news...The one group everyone assumes will benefit is business, but the data seems to disagree. Amid all the hype, US businesses have been slow in adopting the most advanced AI technologies, and there is little evidence that such technologies are contributing significantly to productivity growth or job creation...Turing himself, and other technology pioneers such as Douglas Engelbart and Norbert Wiener, understood that computers would be most useful to business and society when they augmented and complemented human capabilities, not when they competed directly with us...

The difference between outlier detection and data drift detection
When monitoring ML models in production, we can apply different techniques...Data drift and outlier detection are among those. Both are useful when we do not have ground truth labels yet. The data is then the only thing to look at...There are various statistical approaches to detect either (an interesting discussion by itself!), but also a principle difference...

Improving a Machine Learning System (Part 1 - Broken Abstractions)
Suppose you have been hired to apply state of the art machine learning technology to improve the Foo vs Bar classifier at FooBar International. Foo vs Bar classification is a critical business need for FooBar International, and the company has been using a simple system based on a decade-old machine learning technology to solve this problem for the last several years...To your surprise, your new model substantially underperforms compared to the existing system...This is a familiar story that anybody who has built machine learning models at a large company will recognize. Making measurable improvements to a mature machine learning system is extremely difficult. In this post, we will explore why...

Machine Learning from a Bayesian Perspective [PDF]
I summarize a Bayesian perspective of machine learning. We view Bayes as an optimization problem whose solutions use the information-geometry of the posterior. Using this perspective, we can show that many machine-learning methods have a (more general) Bayesian side to them. I believe this perspective to be essential for bridging the gap between ‘artificial’ and ‘natural’ learning systems...

DALL·E mini: Zero-Shot Text-to-Image Generation [Video]
The ability to control image generation with natural language is very fascinating and opens a lot of new opportunities in the field of multimodal machine learning. OpenAI's recent blog about their DALL·E project shows the potential of models, but unfortunately, the model has not been released...Our goal here with DALL·E mini is to show that one can still achieve reasonable performance on this multimodal task with far more accessible means of compute. Even though DALL·E mini is about 30 times smaller than the original and trained on a much smaller dataset, it demonstrates interesting zero-shot capabilities...In this talk, we will get to know DALL·E mini in detail, and explain how it is capable of achieving such results thanks to the use of pre-trained models such as the VQ-GAN and BART. We will dig deeper into the theoretical aspects of these models to understand what happens under the hood in the DALL·E mini pipeline...

Updates and Lessons from AI Forecasting
Earlier this year, my research group commissioned 6 questions for professional forecasters to predict about AI. Broadly speaking, 2 were on geopolitical aspects of AI and 4 were on future capabilities...My overall take from this task and the previous one is that forecasters are pretty confident that we won't have the singularity before 2025, but at the same time there will be demonstrated progress in ML that I would expect to convince a significant fraction of skeptics (in the sense that it will look untenable to hold positions that "Deep learning can't do X")...

An Introduction to Language Models in NLP (Part 1: Intuition)
This post provids an overview of a couple key concepts surrounding language models: a) We define a language model as an algorithm that scores how "human" a sentence is, b) We describe a way to train language models: by observing language and turning these observations into probabilities, and c) We discuss a couple approaches to evaluating the quality of language models: human evaluation (did the robot responses sound natural to a human?), downstream tasks (did the robot responses lead to actual food?), and intrinsic evaluations (how perplexed were the robots by the human utterances?)...

Gradients are Not All You Need
Differentiable programming techniques are widely used in the community and are responsible for the machine learning renaissance of the past several decades. While these methods are powerful, they have limits. In this short report, we discuss a common chaos based failure mode which appears in a variety of differentiable circumstances, ranging from recurrent neural networks and numerical physics simulation to training learned optimizers. We trace this failure to the spectrum of the Jacobian of the system under study, and provide criteria for when a practitioner might expect this failure to spoil their differentiation based optimization algorithms...

Clarity and Aesthetics in Data Visualization: Guidelines
We built an initial set of guidelines that are based on two elements. First, they come from observing actual problems we found over and over again in the solutions submitted to the mini-projects. In this sense the guidelines just emerged from practice. Second, they come from trying to justify our intuitions on notions of visual perception. In this sense the guidelines also rest on considerations stemming from visual perception. In this post I am going to focus on the guidelines...

AI Experts Establish the “North Star” for Domestic Robotics Field
A Stanford AI team creates benchmarks for 100 everyday household tasks for robot assistants, creating a path for more useful agents...

‍

Tools

‍

Create AI-powered search and recommendation apps with Pinecone Pinecone is a fully managed vector database that makes it easy to add vector search to production applications. It combines state-of-the-art vector search libraries, advanced features such as filtering, and distributed infrastructure to provide high performance and reliability at any scale. Get started now — it's free!
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Entry Level Data Scientist: 2022 - IBM - Multiple Locations As a Data Scientist at IBM, you will help transform our clients’ data into tangible business value by analyzing information, communicating outcomes and collaborating on product development. Work with Best in Class open source and visual tools, along with the most flexible and scalable deployment options. Whether it’s investigating patient trends or weather patterns, you will work to solve real world problems for the industries transforming how we live.

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

The Ancient Secrets of Computer Vision - An Introduction to Computer Vision
This class is a general introduction to computer vision. It covers standard techniques in image processing like filtering, edge detection, stereo, flow, etc. (old-school vision), as well as newer, machine-learning based computer vision. It was originally offered in the spring of 2018 at the University of Washington...

Deep Learning With PyTorch - 5 Hour Full YouTube Course
In this course you learn all the fundamentals to get started with PyTorch and Deep Learning: a) Intro, b) Installation, c) Tensor Basics, d) Autograd, e) Backpropagation, f) Gradient Descent, g) Training Pipeline, h) Linear Regression, i) Logistic Regression, j) Dataset and Dataloader, k) Dataset Transforms, l) Softmax and Crossentropy, m) Activation Functions, n) Feed Forward Net, o) CNN, p) Transfer Learning, q) Tensorboard and , r) Save & Load Models...

How to create a Hex Tile Grid Map in Excel
In a previous blog post I showed you how to build a Grid Map with circles using Excel charting capability. In this blog post I’m going to start off from where we left it and use the same data and graph to transform it into the hex tile grid map—as per the below graph showing the US Death Penalty Status in 2020...

‍

Books

‍

a href="https://www.packtpub.com/data/hands-on-machine-learning-with-scikit-learn" style="color: #FF0000;font-weight: normal;text-decoration: none;">

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍