Receive the Data Science Weekly Newsletter every Thursday
Easy to unsubscribe at any time. Your e-mail address is safe.
Data Science Weekly Newsletter
October 28, 2021
Reimagining Philippine mythical creatures using VQGAN+CLIP
Most of what I know from Philippine folklore came from stories that were passed down from one generation to the next. I knew that a kapre is a large creature smoking a cigar because the “auntie of my mother’s friend” said so...If we provide a machine learning model with text descriptions of folk creatures, what images can it conjure?...I used a neural network called VQGAN+CLIP...and supplied it with descriptions of Philippine folk creatures. The resulting images can then be thought of as to what the model “imagined” upon reading them...and they’re a bit surreal and creepy!...
Dirty Data Science: Machine Learning on non-curated data
These slides are a one-hour course on machine learning with non-curated data...According to industry surveys, the number one hassle of data scientists is cleaning the data to analyze it. Here, I survey what "dirtyness" forces - time-consuming cleaning. We will then cover two specific aspects of dirty data: non-normalized entries and missing values. I show how, for these two problems, machine-learning practice can be adapted to work directly on a data table without curation. The normalization problem can be tackled by adapting methods from natural language processing. The missing-values problem will lead us to revisit classic statistical results in the setting of supervised learning...
Just Ask for Generalization
This blog post outlines a key engineering principle I’ve come to believe strongly in for building general AI systems with deep learning. This principle guides my present-day research tastes and day-to-day design choices in building large-scale, general-purpose ML systems...Generalizing to what you want may be easier than optimizing directly for what you want...
A Message From This Week's Sponsor
Quit writing SQL. Find answers faster.
Tired of building dashboards and writing SQL queries for colleagues? PostHog enables teams to get answers by themselves quickly and easily, without needing to write any code.
And it can be deployed on your own infrastructure, which is nice.
PostHog offers everything product-led teams need to grow, including funnel analysis, session recordings and feature flags — all in one platform, all without SQL.
Deploy PostHog today for free.
Data Science Articles & Videos
A Behind-the-Scenes Look at How Postman’s Data Team Works
Postman is no stranger to scale. What started out as a side project six years ago is now one of India’s latest unicorns with a $5.6 billion valuation...Through a series of conversations with Prudhvi Vasa, Postman’s Analytics Leader, I’ve written this article to dive into a behind-the-scenes view of Postman’s data team — how it’s structured, who they hire for different roles, how they plan and prioritize their work democratically, and how they use sprints to constantly identify problems and make improvements...
Machine learning is just statistics + quantifier reversal
In a recent blog post titled “Machine learning is not nonparametric statistics”, Ben Recht speaks to some of the difficulties in applying classical statistical tools to understand why machine learning works. A core piece of his argument goes as follows: Given a fixed classifier, we can assess the classifier’s population error rate using a sample of data and basic statistics. But in machine learning there’s a switcheroo—we select the sample of data first, and then we use that data to select the classifier. This means those classical statistical tools don’t work anymore. What gives?...It turns out that back in 1998, David McAllester worked out an elegant way to deal with this switcheroo that he called quantifier reversal. By applying quantifier reversal, those classical statistical tools become useful again. So what is quantifier reversal? How can it make me a million dollars? And what can it tell me about why machine learning works? That’s what I’m going to answer in this post!...
Apple: On-device Panoptic Segmentation for Camera Using Transformers
The Apple Camera App (in iOS and iPadOS) relies on a wide range of scene-understanding technologies to develop images. In particular, pixel-level understanding of image content, also known as image segmentation, is behind many of the app's front-and-center features...Panoptic segmentation unifies scene-level and subject-level understanding by predicting two attributes for each pixel: a categorical label and a subject label...In this post, we walk through the technical details of how we designed a neural architecture for panoptic segmentation, based on Transformers, that is accurate enough to use in the camera pipeline but compact and efficient enough to execute on-device with negligible impact on battery life...
Parameter Prediction for Unseen Deep Architectures
The algorithms optimizing neural network parameters remain largely hand-designed and computationally inefficient. We study if we can use deep learning to directly predict these parameters by exploiting the past knowledge of training other networks. We introduce a large-scale dataset of diverse computational graphs of neural architectures - DeepNets-1M - and use it to explore parameter prediction on CIFAR-10 and ImageNet. By leveraging advances in graph neural networks, we propose a hypernetwork that can predict performant parameters in a single forward pass taking a fraction of a second, even on a CPU. The proposed model achieves surprisingly good performance on unseen and diverse networks...
How to deploy machine learning with differential privacy?
In many applications of machine learning, such as machine learning for medical diagnosis, we would like to have machine learning algorithms that do not memorize sensitive information about the training set, such as the specific medical histories of individual patients. Differential privacy is a notion that allows quantifying the degree of privacy protection provided by an algorithm on the underlying (sensitive) data set it operates on. Through the lens of differential privacy, we can design machine learning algorithms that responsibly train models on private data...
Applications and Techniques for Fast Machine Learning in Science
We discuss applications and techniques for fast machine learning (ML) in science -- the concept of integrating power ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material...covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions...
Modern Data Stack Conference (MDSCON) 2021: The Top 5 Takeaways You Should Know
A few weeks ago, Fivetran hosted the Modern Data Stack Conference (MDSCON) 2021, a virtual conference to empower data-driven decisions that transform businesses, teams, and careers...For those who missed the conference, and for those who were there but couldn’t attend every session, here are five key ideas and takeaways from MDSCON 2021...
The Future of the Data Engineer
Maxime Beauchemin, one of the first data engineers at Facebook and Airbnb, wrote and open sourced the wildly popular orchestrator, Apache Airflow, followed shortly thereafter by Apache Superset, a data exploration tool that’s taking the data viz landscape by storm....he also wrote the landmark 2017 blog post, The Rise of the Data Engineer...So, five years later, where do we stand?...I sat down with Maxime to discuss the current state of affairs, including the decentralization of the modern data stack, the fragmentation of the data team, the rise of the cloud, and how all these factors have changed the role of the data engineer forever...
A First-Principles Theory of Neural Network Generalization
Deep learning has proven a stunning success for countless problems of interest, but this success belies the fact that, at a fundamental level, we do not understand why it works so well...Perhaps the greatest of these mysteries has been the question of generalization: why do the functions learned by neural networks generalize so well to unseen data?...in our recent paper, we derive a first-principles theory that allows one to make accurate predictions of neural network generalization (at least in certain settings)...
Declutter and Focus: Empirically Evaluating Design Guidelines for Effective Data Communication
The visualization practitioner community prescribes two popular guidelines for creating clear and efficient visualizations: declutter and focus. The declutter guidelines suggest removing non-critical gridlines, excessive labeling of data values, and color variability to improve aesthetics and to maximize the emphasis on the data relative to the design itself. The focus guidelines for explanatory communication recommend including a clear headline that describes the relevant data pattern, highlighting a subset of relevant data values with a unique color, and connecting those values to written annotations that contextualize them in a broader argument. We evaluated how these recommendations impact recall of the depicted information across cluttered, decluttered, and decluttered+focused designs of six graph topics...
Retool is the fast way to build an interface for any database
With Retool, you don't need to be a developer to quickly build an app or dashboard on top of any data set. Data teams at companies like NBC use Retool to build any interface on top of their data—whether it's a simple read-write visualization or a full-fledged ML workflow.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!
Reddit Discussion: A Guide to Tesla’s Configurable Floating Point Formats & Arithmetic
Tesla just randomly dropped a PDF with details of the custom floating point formats they've created for their Dojo training hardware...I think it's pretty interesting. They want to eliminate 32 bit floating point from training almost entirely, using custom 16-bit and even 8-bit floating point formats instead, with a configurable "exponent bias" that is shared between many numbers and can apparently be learned during training. Also, they have stochastic rounding which seems like a great idea for low precision formats. Worth a glance if you care about hardware....