Data Science Weekly Newsletter

Issue

434

March 17, 2022

‍

Editor's Picks

‍

A Deep Dive into NLP Tokenization and Encoding with Word and Sentence Embeddings
This is a deep dive: over 8,000 words long. Don’t be afraid to bookmark this article and read it in pieces. There is a lot to cover. We will start with basic One-Hot encoding, move on to word2vec word and sentence embeddings, build our own custom embeddings using R, and finally, work with the cutting-edge BERT model and its contextual embeddings...

Making Deep Learning Go Brrrr From First Principles
So, you want to improve the performance of your deep learning model. How might you approach such a task? Often, folk fall back to a grab-bag of tricks that might've worked before or saw on a tweet. "Use in-place operations! Set gradients to None! Install PyTorch 1.10.0 but not 1.10.1!"...It's understandable why users often take such an ad-hoc approach performance on modern systems (particularly deep learning) often feels as much like alchemy as it does science. That being said, reasoning from first principles can still eliminate broad swathes of approaches, thus making the problem much more approachable...So, if you want to keep your GPUs going brrrr, let's discuss the three components your system might be spending time on - compute, memory bandwidth, and overhead...

Announcing the 2022 AI Index Report
The AI Index is an independent initiative at the Stanford Institute for Human-Centered Artificial Intelligence (HAI), led by the AI Index Steering Committee, an interdisciplinary group of experts from across academia and industry. The annual report tracks, collates, distills, and visualizes data relating to artificial intelligence, enabling decision-makers to take meaningful action to advance AI responsibly and ethically with humans in mind...

‍

A Message From This Week's Sponsor

‍

Retool is the fast way to build an interface for any database With Retool, you don't need to be a developer to quickly build an app or dashboard on top of any data set. Data teams at companies like NBC use Retool to build any interface on top of their data—whether it's a simple read-write visualization or a full-fledged ML workflow. Drag and drop UI components—like tables and charts—to create apps. At every step, you can jump into the code to define the SQL queries and JavaScript that power how your app acts and connects to data. The result—less time on repetitive work and more time to discover insights.

‍

Data Science Articles & Videos

‍

The “0 / 1 / Done” Strategy for Data Science
To achieve operational excellence in applied data science delivery aim for: 0-day Handovers, 1-day Prototyping and to declare projects as Done when Completely Done. Work backwards from these goals, conducting a gap analysis between those and current team capabilities, to identify process, tooling and governance initiatives to reach them...

Data salaries at FAANG companies in 2022
What 4000 data points can tell us about the state of data salaries at top tech companies...

What Good Data Product Managers Do – And Why You Probably Need One
We are seeing data departments modernize their team structure with data product managers at the helm of such projects...In this article, we’ll walk through: a) What is a data product manager? How did the role evolve?, b) What is a data product?, c) What does a data product manager do? What skills do they need?, d) What background do product managers need? Who do they report to?, e) Data product manager vs product manager, f) Data product manager vs data scientist, and g) The future of the data product manager...

How to Build Effective (and Useful) Dashboards
With practice I have been developing a four-step approach (that I am still fine-tuning) to build dashboards that are, first of all, effective, but also useful...In this article I want to take you through these four steps. Whether you are an experienced analyst building data visualizations all day long or a business user using dashboards from time to time, I hope you will find these guidelines useful...

Future ML Systems Will Be Qualitatively Different
In 1972, the Nobel prize-winning physicist Philip Anderson wrote the essay "More Is Different". In it, he argues that quantitative changes can lead to qualitatively different and unexpected phenomena...In this post, I'll argue that emergence often occurs in the field of AI, and that this should significantly affect our intuitions about the long-term development and deployment of AI systems. We should expect weird and surprising phenomena to emerge as we scale up systems. This presents opportunities, but also poses important risks...

Efficient Learning of Safe Driving Policy via Human-AI Copilot Optimization
Our recent #ICLR2022 work develops an efficient Human-in-the-loop learning method called HACO (Human-AI CoPilot), to mentor agents interactively to act and drive while preserving their own curiosity and exploration...

Building systems to securely reason over private data
People today rely on AI systems such as assistants and chatbots to help with countless tasks...For systems to execute these tasks, users must provide them with relevant information — such as one’s location or work calendar. In some cases, however, people would prefer to keep information private, which means not uploading it to cloud-based AI systems or sharing it with others...Meta AI is releasing ConcurrentQA, the first public data set for studying information retrieval and question answering (QA) with data from multiple privacy scopes. Alongside the data set and problem exploration, we have developed a new methodology as a starting point for thinking about privacy in retrieval-based settings called Public-Private Autoregressive Information Retrieval (PAIR)...

Generative Flow Networks
I [Yoshua Bengio] have rarely been as enthusiastic about a new research direction. We call them GFlowNets, for Generative Flow Networks. They live somewhere at the intersection of reinforcement learning, deep generative models and energy-based probabilistic modelling...What I find exciting is that they open so many doors, but in particular for implementing the system 2 inductive biases I have been discussing in many of my papers and talks since 2017, that I argue are important to incorporate causality and deal with out-of-distribution generalization in a rational way...

The promise of AI with Demis Hassabis
Last episode of Season 2 of DeepMind: The Podcast...Hannah wraps up the series by meeting DeepMind co-founder and CEO, Demis Hassabis. In an extended interview, Demis describes why he believes AGI is possible, how we can get there, and the problems he hopes it will solve. Along the way, he highlights the important role of consciousness and why he’s so optimistic that AI can help solve many of the world’s major challenges. As a final note, Demis shares the story of a personal meeting with Stephen Hawking to discuss the future of AI and discloses Hawking’s parting message...

Organizing and scaling an effective data team
Here’s a high-level sketch of how to organize and scale an effective data team...

Winning at Competitive ML in 2022
Hoping to win a machine learning competition in 2022? Here’s what you need to know. I collaborated with ML Contests, using their database of over 80 competitions that took place in 2021 across Kaggle, DrivenData, AICrowd, Zindi, and 13 other platforms. Wherever the information was available, we categorized winners to figure out what made them win...

‍

Summit

‍

You're invited to the first-ever Metrics Store Summit Transform is hosting the first-ever industry summit on the metrics layer. The first-ever Metrics Store Summit on April 26, 2022 will bring discussions around the semantic layer into one event—providing context with use cases for metrics stores, highlighting applications for metrics, and sharing ideas from leaders across the modern data stack.You can expect to hear from Airbnb, Slack, Spotify, Atlan, Hex, Mode, Hightouch, AtScale and many more in this action-packed 1-day event. We would love to see you there! Register today for free. *Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Sr. Machine Learning Engineer - eBay - NYC The eBay Ads team focuses on building data / ML services for our advertiser sellers to guide them to optimize for their Ads budget and goals, for example by recommending the right inventory, keywords and Ads bid rate to apply for their campaigns, and eventually create campaigns automatically for advertisers. This is a relatively new area but with a very high business potential and need. It would allow you to work with massive amounts of data and use a variety of data science techniques.

We are looking for someone passionate about deploying reliable and efficient ML services to a production environment at scale. You will collaborate with researchers to invent, design and implement end-to-end production services in Python and Java/Scala using state-of-the-art big data and ML tools. Come and help us blow away the boundaries of e-commerce through AI!

If you are interested in applying, please contact the hiring manager, Dan Schonfeld, at dschonfeld@ebay.com.

‍

Training & Resources

‍

Efficient Transformers: A Survey
Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning...Recently, a dizzying number of "X-former" models have been proposed - Reformer, Linformer, Performer, Longformer, to name a few - which improve upon the original Transformer architecture, many of which make improvements around computational and memory efficiency. With the aim of helping the avid researcher navigate this flurry, this paper characterizes a large and thoughtful selection of recent efficiency-flavored "X-former" models, providing an organized and comprehensive overview of existing work and models across multiple domains...

GFlowNet Tutorial
A GFlowNet is a trained stochastic policy or generative model, trained such that it samples objects $x$ through a sequence of constructive steps, with probability proportional to $R(x)$, where $R$ is some given non-negative integrable reward function...

The Bayesian Learning Rule
We show that many machine-learning algorithms are specific instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton's method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍