Receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe at any time. Your e-mail address is safe.

Data Science Weekly Newsletter
April 14, 2022

Editor's Picks

  • The Modern Data Stack Ecosystem: Spring 2022 Edition
    In this article, we’ll provide an in-depth look at the Modern Data Stack (MDS) ecosystem...We put together this article to highlight the most crucial components of the MDS and the main tools and vendors in each component. This list is not meant to be exhaustive; instead, we’re focused on compiling feedback and observations from our work with MDS customers to provide a CliffsNotes-style guide that should help people exploring the MDS to understand where to start...
  • Playing with DALL·E 2
    I got access to Dall·E 2 yesterday. Here are some pretty pictures!...My goal was to try to understand what things DE2 could do well, and what things it had trouble understanding or generating. My general hypothesis is that it would do a better job with things that are easy to find on the internet (cute animals, digital scifi things, famous art) and less well with more abstract or more unusual things...Here's how it works...

A Message From This Week's Sponsor

Registrations open for apply(), the ML data engineering conference Sign up for free and tune-in on May 18-19. Data and ML teams will come together to discuss the practical data engineering challenges of Operational ML. Agenda highlights:
  • Curated talks to hear from the best ML thought leaders and practitioners (from organizations like Twitter, Instacart, Stripe, Uber, Walmart, Faire, Snapchat, Wikimedia...)
  • Hands-on workshop to see emerging MLOps tools in action
  • Virtual networking with the speakers and your peers
  • In-person meetups in NYC and SF to connect with the community!

See the full agenda and register for free.

Data Science Articles & Videos

  • The end of Big Data
    Databricks, Snowflake, and the end of an overhyped era...Over time, I expect the rest of the industry to follow Snowflake and dbt away from its history of technological hype and up the “slope of enlightenment.”7 For most of us, this looks like embracing what we consider pedestrian, talking about problems and not technologies, and listening to a wider array of professional voices...
  • Large-Scale Matrix Factorization on TPUs
    Matrix factorization is one of the oldest, yet still widely used, techniques for learning how to recommend items such as songs or movies from user ratings. In its basic form, it approximates a large, sparse (i.e., mostly empty) matrix of user-item interactions with a product of two smaller, denser matrices representing learned item and user features...we explore a distributed ALS design that makes efficient use of the TPU architecture and can scale well to matrix factorization problems of the order of billions of rows and columns by scaling the number of available TPU cores...
  • How to Structure a Data Science Project for Readability and Transparency
    It is important to structure your data science project based on a certain standard so that your teammates can easily maintain and modify your project...Wouldn’t it be nice if you can create an ideal structure for a data science project using a template?...That is why I created a repository name data-science-template. This repository is the result of my years refining the best way to structure a data science project so that it is reproducible and maintainable...
  • The counter-intuitive rise of Python in scientific computing
    In our laboratory, a polarizing debate rages since around 2010, summarized by this question: "Why are more and more time-critical scientific computations formerly performed in Fortran now written in Python, a slower language?"...The terms are vague, encouraging tribal wars between users based more on their habits than on objective assessments about the two approaches. Let’s try to give some elements to reach a mutual understanding, by narrowing the question....
  • Improving Code Reviews with Github’s Copilot
    In this episode, I talk to Paige Bailey, the director of Machine learning and machine learning operations, aka MLOps, at GitHub...We talk about: a) How to help data scientist review their code, b) Using Github Copilot to write and understand code, c) How machine learning can help improve code reviews, and d) How to find security vulnerabilities automatically in code...
  • You Should Use This to Visualize SQL Joins Instead of Venn Diagrams
    A couple weeks ago I published an article about SQL Anti-Joins on Reddit...Not long after I shared it, I got this response: "Please stop using Venn Diagrams. They confuse people more than they help."...This piqued my interest since I hadn’t read or heard of anyone who thought Venn diagrams were a bad way to visualize SQL joins up to this point... decided to write this article after I thought about the arguments from both sides and then found what I think is an underrated visualization for SQL joins that I am calling the checkered flag diagram. ...
  • Machine Learning State-of-the-Art with Uncertainties
    With the availability of data, hardware, software ecosystem and relevant skill sets, the machine learning community is undergoing a rapid development with new architectures and approaches appearing at high frequency every year. In this article, we conduct an exemplary image classification study in order to demonstrate how confidence intervals around accuracy measurements can greatly enhance the communication of research results as well as impact the reviewing process...
  • Teaching visualization
    I had been teaching visualization for years, to computer science students, informatics students, and occasionally journalism students, and recently overhauled how I do it curriculum-wise, including to focus a little more on ‘visualizations as model checks’ and visualization for decision making...
  • Compute Funds and Pre-trained Models
    The US National AI Research Resource should provide structured access to models, not just data and compute...This post, authored by Markus Anderljung, Lennart Heim, and Toby Shevlane, argues that a newly proposed US government institution has an opportunity to support “structured access” to large AI models...
  • The Case for Dataset-Centric Visualization
    Different BI tools offer different approaches to building dashboards. On one end of the spectrum, you have tools that prescribe having one query per chart and on the other end you have tools that espouse implementing a complex semantic layer. I believe there's a middle path that lies between both extremes, and I call it the dataset-centric approach...
  • Build Data Factories, Not Data Warehouses
    The data warehouse is a broken metaphor in the modern data stack...We aren’t loading indistinguishable pallets of data into virtual warehouses, where we stack them in neat rows and columns and then forklift them out onto delivery trucks...Instead, we feed raw data into factories filled with complex assembly lines connected by conveyor belts. Our factories manufacture customized and evolving data products for various internal and external customers...


You're invited to the first-ever Metrics Store Summit Transform is hosting the first-ever industry summit on the metrics layer. The first-ever Metrics Store Summit on April 26, 2022 will bring discussions around the semantic layer into one event—providing context with use cases for metrics stores, highlighting applications for metrics, and sharing ideas from leaders across the modern data stack.You can expect to hear from Airbnb, Slack, Spotify, Atlan, Hex, Mode, Hightouch, AtScale and many more in this action-packed 1-day event. We would love to see you there! Register today for free.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!


Training & Resources

  • Scruff.jl
    Scruff is an AI framework to build agents that sense, reason, and learn in the world using a variety of models. It aims to integrate many different kinds of models in a coherent framework, provide flexibility in spatiotemporal modeling, and provide tools to compose, share, and reuse models and model components...
  • Understand Machine Learning Through 7 Software Design Patterns
    Objects, interfaces, classes, and inheritance are common concepts in the world of Object-Oriented-based software. What is always challenging is to think of different and creative ways to build flexible, reusable software components...Design Patterns are made up for this purpose. They tend to create structures and reuse successful designs and architectures...I am attempting to implement several design patterns while imagining small scenarios in the realm of Machine Learning Development...
  • An Introduction to Exceedance Probability Forecasting
    Exceedance probability forecasting is the problem of estimating the probability that a time series will exceed a predefined threshold in a predefined future period...In a previous post I briefly described 6 problems that arise with time series data, including exceedance probability forecasting. Here I will dive deeper into this task. After some basic definitions I will explain why this problem matters, and how you can apply it in your own time series using Python....


P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

Easy to unsubscribe at any time. Your e-mail address is safe.