Data Science Weekly Newsletter

Issue

439

April 21, 2022

‍

Editor's Picks

‍

Real World Recommendation System - Part 1
Training a collaborative filtering based recommendation system on a toy dataset is a sophomore year project in colleges these days. But where the rubber meets the road is building such a system at scale, deploying in production, and serving live requests within a few hundred milliseconds while the user is waiting for the page to load. To build a system like this, engineers have to make decisions spanning multiple moving layers like...

Advances in Understanding, Improving, and Applying Contrastive Learning
Contrastive learning has emerged as a powerful method for training ML models. In this series of three blog posts, we’ll discuss recent advances in understanding the mechanisms behind contrastive learning. We’ll see how we can use those insights to get better learned representations out of supervised contrastive learning, and see how we can apply contrastive learning to improve long-tailed entity retrieval...

ICLR Conference's First Blogpost Track Experiment was a great success with 20 accepted posts
Our goal is to create a formal call for blog posts at ICLR to incentivize and reward researchers to review past work and summarize the outcomes, develop new intuitions, or highlight some shortcomings...Here are the 20 posts...

‍

A Message From This Week's Sponsor

‍

Retool is the fast way to build an interface for any database With Retool, you don't need to be a developer to quickly build an app or dashboard on top of any data set. Data teams at companies like NBC use Retool to build any interface on top of their data—whether it's a simple read-write visualization or a full-fledged ML workflow. Drag and drop UI components—like tables and charts—to create apps. At every step, you can jump into the code to define the SQL queries and JavaScript that power how your app acts and connects to data. The result—less time on repetitive work and more time to discover insights.

‍

Data Science Articles & Videos

‍

Faking It: How to Simulate Complex Data Generation Processes in R, Tidyverse Edition
Data simulation is easily near the top of the long list of useful skills that are seldom taught in social science graduate programs. This is unfortunate given the central role of simulation in model checking, sensitivity analysis, and developing a basic understanding of modeling assumptions and often complex relationships between the phenomena social scientists aspire to understand. My aim in this blog post is thus to provide a basic introduction to data simulation and parameter recovery in R for cross-sectional time series and non-nested data structures commonly encountered in political science and international relations...

Ever wondered how the probability of the null hypothesis being true changes given a significant result?
In a recently accepted paper...we discuss how, using Bayes' rule, one can explore the change in the probability of a null hypothesis being true (call it theta) when you get a significant effect. The paper...shows that theta does not necessarily change much even if you get a significant result. The probability theta can change dramatically under certain conditions, but those conditions are either so stringent or so trivial that it renders many of the significance-based conclusions in psychology and psycholinguistics questionable at the very least...You can do your own simulations...using this shiny app below...

All the talks and the Q&As from the #Outlier2022 Data Viz Conference
All the curated talks, lighting talks and the Q&As from the 2022 edition of the Outlier conference...For all the #dataviz enthusiasts out there. Bookmark this playlist by the @DataVizSociety and @OutlierConf. It contains all the curated talks, lighting talks and the Q&As from the #Outlier2022! #datajournalism...

Bad ML Abstractions I (Generative vs Discriminative Models)
This post is part of a series on bad abstractions in machine learning...Bad Abstraction: There are two types of machine learning models. Discriminative models are trained to separate inputs into classes, while generative models learn a distribution from which they can draw new samples...These two categories are not actually distinct...

A Robot Web for Distributed Many-Device Localisation
We show that a distributed network of robots or other devices which make measurements of each other can collaborate to globally localise via efficient ad-hoc peer to peer communication. Our Robot Web solution is based on Gaussian Belief Propagation on the fundamental non-linear factor graph describing the probabilistic structure of all of the observations robots make internally or of each other, and is flexible for any type of robot, motion or sensor...

Probability Distributions To Be Aware Of For Data Science (With Code)
Knowing the distribution of data helps us better model the world around us. It helps us to determine the likeliness of various outcomes, or make an estimate of the variability of an occurrence. All of this makes knowing different probability distributions extremely valuable in data science & machine learning...In this article, we are going to cover a few distributions and share some Python code to display them visually...

A Tour of Visualization Techniques for Computer Vision Datasets
We survey a number of data visualization techniques for analyzing Computer Vision (CV) datasets. These techniques help us understand properties and latent patterns in such data, by applying dataset-level analysis. We present various examples of how such analysis helps predict the potential impact of the dataset properties on CV models and informs appropriate mitigation of their shortcomings. Finally, we explore avenues for further visualization techniques of different modalities of CV datasets as well as ones that are tailored to support specific CV tasks and analysis needs...

A Physicist’s View: The Thermodynamics of Machine Learning
Complex systems are ubiquitous in nature, and physicists have found great success using thermodynamics to study these system. Machine learning can be very complex, so can we use thermodynamics to understand it?...

On NYT Magazine on AI: Resist the Urge to be Impressed
On April 15, 2022, Steven Johnson published a piece in the New York Times Magazine entitled “A.I. Is Mastering Language. Should We Trust What It Says?”...Emily M. Bender, Professor, Linguistics, University of Washington, unpacks a recent NYT Magazine article on the future of AI and language models...

The Distributed Information Bottleneck reveals the explanatory structure of complex systems
The fruits of science are relationships made comprehensible, often by way of approximation. While deep learning is an extremely powerful way to find relationships in data, its use in science has been hindered by the difficulty of understanding the learned relationships. The Information Bottleneck (IB) is an information theoretic framework for understanding a relationship between an input and an output in terms of a trade-off between the fidelity and complexity of approximations to the relationship. Here we show that a crucial modification -- distributing bottlenecks across multiple components of the input -- opens fundamentally new avenues for interpretable deep learning in science...

Comprehensive Guide to GitHub for Data Scientists
The purpose behind this article is to give data scientists / analysts (or any non engineering focused individual) the run down on how to use GitHub and what best practices to adhere too. The tutorial will consist of a combination guidelines using the UI and command line (terminal). The naming convention for Git commands are consistent across the platforms provided by GitHub so the skills should be exchangeable if you prefer to use Github desktop or GitLab instead of the web UI or command line. The following is the outline for the article...

‍

Summit

‍

You're invited to the first-ever Metrics Store Summit Transform is hosting the first-ever industry summit on the metrics layer. The first-ever Metrics Store Summit on April 26, 2022 will bring discussions around the semantic layer into one event—providing context with use cases for metrics stores, highlighting applications for metrics, and sharing ideas from leaders across the modern data stack.You can expect to hear from Airbnb, Slack, Spotify, Atlan, Hex, Mode, Hightouch, AtScale and many more in this action-packed 1-day event. We would love to see you there! Register today for free.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Data Scientist - Hungryroot - Remote

Hungryroot is looking for a Data Scientist to join our growing Data Team. As a Data Scientist, you will work closely with other Data Scientists and Data Engineers to develop various Machine Learning models that power Hungryroot and it’s AI functions. These models include traditional forecasting models, as well as more industry-specific optimization challenges.

As a Data Scientist at Hungryroot, you will work on answering questions like: how do you tell what food someone would like to eat this week, how do you determine whether they enjoyed it or not, maybe they liked their means last week, but are now looking for different options, maybe they like the same food on Tuesdays, but variety on Fridays, what about spicy food, is Green Chilly as spicy as Green Curry?

‍

Training & Resources

‍

R Graphics Cookbook, 2nd edition
Welcome to the R Graphics Cookbook, a practical guide that provides more than 150 recipes to help you generate high-quality graphs quickly, without having to comb through all the details of R’s graphing systems. Each recipe tackles a specific problem with a solution you can apply to your own project, and includes a discussion of how and why the recipe works...Read online here for free, or buy a physical copy...

What are Diffusion Models? [Video]
This short tutorial covers the basics of diffusion models, a simple yet expressive approach to generative modeling. They've been behind a recent string of impressive results, including OpenAI's DALL-E 2...

A *simple* introduction to ggplot2 (for plotting your data!)
How ggplot2 works, or learning the basics of ggplot2: data and aesthetics and geometry, oh my!...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍