Data Science Weekly Newsletter

Issue

438

April 14, 2022

‍

Editor's Picks

‍

The Modern Data Stack Ecosystem: Spring 2022 Edition
In this article, we’ll provide an in-depth look at the Modern Data Stack (MDS) ecosystem...We put together this article to highlight the most crucial components of the MDS and the main tools and vendors in each component. This list is not meant to be exhaustive; instead, we’re focused on compiling feedback and observations from our work with MDS customers to provide a CliffsNotes-style guide that should help people exploring the MDS to understand where to start...

What's next for AlphaFold and the AI protein-folding revolution
DeepMind software that can predict the 3D shape of proteins is already changing biology...

Playing with DALL·E 2
I got access to Dall·E 2 yesterday. Here are some pretty pictures!...My goal was to try to understand what things DE2 could do well, and what things it had trouble understanding or generating. My general hypothesis is that it would do a better job with things that are easy to find on the internet (cute animals, digital scifi things, famous art) and less well with more abstract or more unusual things...Here's how it works...

‍

A Message From This Week's Sponsor

‍

Registrations open for apply(), the ML data engineering conference Sign up for free and tune-in on May 18-19. Data and ML teams will come together to discuss the practical data engineering challenges of Operational ML. Agenda highlights:

Curated talks to hear from the best ML thought leaders and practitioners (from organizations like Twitter, Instacart, Stripe, Uber, Walmart, Faire, Snapchat, Wikimedia...)
Hands-on workshop to see emerging MLOps tools in action
Virtual networking with the speakers and your peers
In-person meetups in NYC and SF to connect with the community!

See the full agenda and register for free.

‍

Data Science Articles & Videos

‍

The end of Big Data
Databricks, Snowflake, and the end of an overhyped era...Over time, I expect the rest of the industry to follow Snowflake and dbt away from its history of technological hype and up the “slope of enlightenment.”7 For most of us, this looks like embracing what we consider pedestrian, talking about problems and not technologies, and listening to a wider array of professional voices...

Large-Scale Matrix Factorization on TPUs
Matrix factorization is one of the oldest, yet still widely used, techniques for learning how to recommend items such as songs or movies from user ratings. In its basic form, it approximates a large, sparse (i.e., mostly empty) matrix of user-item interactions with a product of two smaller, denser matrices representing learned item and user features...we explore a distributed ALS design that makes efficient use of the TPU architecture and can scale well to matrix factorization problems of the order of billions of rows and columns by scaling the number of available TPU cores...

How to Structure a Data Science Project for Readability and Transparency
It is important to structure your data science project based on a certain standard so that your teammates can easily maintain and modify your project...Wouldn’t it be nice if you can create an ideal structure for a data science project using a template?...That is why I created a repository name data-science-template. This repository is the result of my years refining the best way to structure a data science project so that it is reproducible and maintainable...

The counter-intuitive rise of Python in scientific computing
In our laboratory, a polarizing debate rages since around 2010, summarized by this question: "Why are more and more time-critical scientific computations formerly performed in Fortran now written in Python, a slower language?"...The terms are vague, encouraging tribal wars between users based more on their habits than on objective assessments about the two approaches. Let’s try to give some elements to reach a mutual understanding, by narrowing the question....

Improving Code Reviews with Github’s Copilot
In this episode, I talk to Paige Bailey, the director of Machine learning and machine learning operations, aka MLOps, at GitHub...We talk about: a) How to help data scientist review their code, b) Using Github Copilot to write and understand code, c) How machine learning can help improve code reviews, and d) How to find security vulnerabilities automatically in code...

You Should Use This to Visualize SQL Joins Instead of Venn Diagrams
A couple weeks ago I published an article about SQL Anti-Joins on Reddit...Not long after I shared it, I got this response: "Please stop using Venn Diagrams. They confuse people more than they help."...This piqued my interest since I hadn’t read or heard of anyone who thought Venn diagrams were a bad way to visualize SQL joins up to this point... decided to write this article after I thought about the arguments from both sides and then found what I think is an underrated visualization for SQL joins that I am calling the checkered flag diagram. ...

Machine Learning State-of-the-Art with Uncertainties
With the availability of data, hardware, software ecosystem and relevant skill sets, the machine learning community is undergoing a rapid development with new architectures and approaches appearing at high frequency every year. In this article, we conduct an exemplary image classification study in order to demonstrate how confidence intervals around accuracy measurements can greatly enhance the communication of research results as well as impact the reviewing process...

Teaching visualization
I had been teaching visualization for years, to computer science students, informatics students, and occasionally journalism students, and recently overhauled how I do it curriculum-wise, including to focus a little more on ‘visualizations as model checks’ and visualization for decision making...

Compute Funds and Pre-trained Models
The US National AI Research Resource should provide structured access to models, not just data and compute...This post, authored by Markus Anderljung, Lennart Heim, and Toby Shevlane, argues that a newly proposed US government institution has an opportunity to support “structured access” to large AI models...

The Case for Dataset-Centric Visualization
Different BI tools offer different approaches to building dashboards. On one end of the spectrum, you have tools that prescribe having one query per chart and on the other end you have tools that espouse implementing a complex semantic layer. I believe there's a middle path that lies between both extremes, and I call it the dataset-centric approach...

Build Data Factories, Not Data Warehouses
The data warehouse is a broken metaphor in the modern data stack...We aren’t loading indistinguishable pallets of data into virtual warehouses, where we stack them in neat rows and columns and then forklift them out onto delivery trucks...Instead, we feed raw data into factories filled with complex assembly lines connected by conveyor belts. Our factories manufacture customized and evolving data products for various internal and external customers...

‍

Summit

‍

You're invited to the first-ever Metrics Store Summit Transform is hosting the first-ever industry summit on the metrics layer. The first-ever Metrics Store Summit on April 26, 2022 will bring discussions around the semantic layer into one event—providing context with use cases for metrics stores, highlighting applications for metrics, and sharing ideas from leaders across the modern data stack.You can expect to hear from Airbnb, Slack, Spotify, Atlan, Hex, Mode, Hightouch, AtScale and many more in this action-packed 1-day event. We would love to see you there! Register today for free.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Data Scientist - Hungryroot - Remote Hungryroot is looking for a Data Scientist to join our growing Data Team. As a Data Scientist, you will work closely with other Data Scientists and Data Engineers to develop various Machine Learning models that power Hungryroot and it’s AI functions. These models include traditional forecasting models, as well as more industry-specific optimization challenges.

As a Data Scientist at Hungryroot, you will work on answering questions like: how do you tell what food someone would like to eat this week, how do you determine whether they enjoyed it or not, maybe they liked their means last week, but are now looking for different options, maybe they like the same food on Tuesdays, but variety on Fridays, what about spicy food, is Green Chilly as spicy as Green Curry?

‍

Training & Resources

‍

Scruff.jl
Scruff is an AI framework to build agents that sense, reason, and learn in the world using a variety of models. It aims to integrate many different kinds of models in a coherent framework, provide flexibility in spatiotemporal modeling, and provide tools to compose, share, and reuse models and model components...

Understand Machine Learning Through 7 Software Design Patterns
Objects, interfaces, classes, and inheritance are common concepts in the world of Object-Oriented-based software. What is always challenging is to think of different and creative ways to build flexible, reusable software components...Design Patterns are made up for this purpose. They tend to create structures and reuse successful designs and architectures...I am attempting to implement several design patterns while imagining small scenarios in the realm of Machine Learning Development...

An Introduction to Exceedance Probability Forecasting
Exceedance probability forecasting is the problem of estimating the probability that a time series will exceed a predefined threshold in a predefined future period...In a previous post I briefly described 6 problems that arise with time series data, including exceedance probability forecasting. Here I will dive deeper into this task. After some basic definitions I will explain why this problem matters, and how you can apply it in your own time series using Python....

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍