Licence Friction: A Tale of Two Datasets
It’s well-reported that data scientists and other users spend huge amounts of time cleaning and tidying data because it’s messy and non-standardised. It’s probably less well-reported how many great ideas are simply shelved because of lack of access to data...Here’s a real-world example of where the lack of open geospatial data in the UK, and ongoing incompatibilities between data licensing is getting in the way of useful work...
Back of the Envelope Machine Learning
Data science projects fail, frequently. Between the end of 2017 and 2019 several published reports from Gartner, NewVantage, and VentureBeat AI showed that ‘failure’ rates on data science projects are north of 75%...A premortem is a thought exercise to predict or foresee why an analysis or project might fail. It forces the participants to change their perspective by porting their frame of reference to the hypothetical future after the project has failed...During a premortem for a proposed data science project, the participants could say...
Curriculum for Reinforcement Learning
A curriculum is an efficient tool for humans to progressively learn from simple concepts to hard problems. It breaks down complex knowledge by providing a sequence of learning steps of increasing difficulty. In this post, we will examine how the idea of curriculum can help reinforcement learning models learn to solve complicated tasks...
A Message From This Week's Sponsor
Watch Now: Hands-On Tutorial for Generative Adversarial Networks (GANs)
Are you interested in generating data to improve model accuracy? Are you concerned about training instability or failure to converge issues? Watch this recorded webinar about GANs, presented by the Principal Data Scientist for EMEA at Domino Data Lab.
This webinar covers the GAN framework, how to implement a basic GAN model, how adversarial networks are used to generate training samples, training difficulties, and recent research to improve upon GANs' training, including Wasserstein GAN (WGAN).
Data Science Articles & Videos
The Essential Machine Learning Project Checklist
This checklist was created to help ML students/practitioners structure their projects and problems in a way that makes sense to me...When I just got started learning Python for Machine Learning and worked on my first few projects, I found it very overwhelming because...a) it was difficult to remember all of the steps I needed to take in order to make my data ML-friendly, b) I couldn't easily remember the functions, methods, and estimators from pandas, NumPy, and sklearn, and c) it was tedious and time-consuming to try to understand large (>50 feature) datasets...So, I created the ML checklist to be a handy tool for whenever I start to feel lost creating an ML project...
Hyperparameter tuning with Keras Tuner
The success of a machine learning project is often crucially dependent on the choice of good hyperparameters. As machine learning continues to mature as a field, relying on trial and error to find good values for these parameters (also known as “grad student descent”) simply doesn’t scale. In fact, many of today’s state-of-the-art results, such as EfficientNet, were discovered via sophisticated hyperparameter optimization algorithms...Keras Tuner is an easy-to-use, distributable hyperparameter optimization framework that solves the pain points of performing a hyperparameter search...
Using artificial intelligence to enrich digital maps
A model invented by researchers at MIT and Qatar Computing Research Institute (QCRI) that uses satellite imagery to tag road features in digital maps could help improve GPS navigation...creating detailed maps is an expensive, time-consuming process done mostly by big companies...because this process is expensive, however, some parts of the world are ignored...the MIT and QCRI researchers describe “RoadTagger,” which uses a combination of neural network architectures to automatically predict the number of lanes and road types (residential or highway) behind obstructions...
Going deep on deep learning with Dr. Jianfeng Gao
Dr. Jianfeng Gao is a veteran computer scientist, an IEEE Fellow and the current head of the Deep Learning Group at Microsoft Research...Today, Dr. Gao gives us an overview of the deep learning landscape and talks about his latest work on Multi-task Deep Neural Networks, Unified Language Modeling and vision-language pre-training. He also unpacks the science behind task-oriented dialog systems as well as social chatbots like Microsoft Xiaoice, and gives us some great book recommendations along the way!...
Open-source library provides explanation for machine learning through diverse counterfactuals
Consider a person who applies for a loan with a financial company, but their application is rejected by a machine learning algorithm used to determine who receives a loan from the company. How would you explain the decision made by the algorithm to this person? One option is to provide them with a list of features that contributed to the algorithm’s decision, such as income and credit score. Many of the current explanation methods provide this information by either analyzing the algorithm’s properties or approximating it with a simpler, interpretable model...However, these explanations do not help this person decide what to do next to increase their chances of getting the loan in the future...
How Trip Inferences and Machine Learning Optimize Delivery Times on Uber Eats
Modeling the real world logistics that go into an Uber Eats trip is a complex problem. With incomplete information, small decisions can make or break the experience for our delivery-partners and eaters. One of the primary issues we try to optimize is when our business logic dispatches a delivery-partner to pick up an order. If the dispatch is too early, the delivery-partner waits while the food is being prepared. If the dispatch is too late, the food may not be as fresh as it could be, and it arrives late to the eater...we created our Uber Eats Trip State Model, letting us segment out each stage of a trip. Further, this model lets us collect and use historical data for individual restaurants so we can optimize delivery times for both our delivery-partners and eaters...
Once users have shared their data online, it is generally difficult for them to revoke access and ask for the data to be deleted. Machine learning (ML) exacerbates this problem because any model trained with said data may have memorized it, putting users at risk of a successful privacy attack exposing their information. Yet, having models unlearn is notoriously difficult. After a data point is removed from a training set, one often resorts to entirely retraining downstream models from scratch. We introduce SISA training, a framework that decreases the number of model parameters affected by an unlearning request and caches intermediate outputs of the training algorithm to limit the number of model updates that need to be computed to have these parameters unlearn...
Q-Learning in enormous action spaces via amortized approximate maximization
Applying Q-learning to high-dimensional or continuous action spaces can be difficult due to the required maximization over the set of possible actions. Motivated by techniques from amortized inference, we replace the expensive maximization over all actions with a maximization over a small subset of possible actions sampled from a learned proposal distribution. The resulting approach, which we dub Amortized Q-learning (AQL), is able to handle discrete, continuous, or hybrid action spaces while maintaining the benefits of Q-learning...
OpenAI → PyTorch
We are standardizing OpenAI’s deep learning framework on PyTorch. In the past, we implemented projects in many frameworks depending on their relative strengths. We’ve now chosen to standardize to make it easier for our team to create and share optimized implementations of our models...Going forward we’ll primarily use PyTorch as our deep learning framework...
Want to post a job here? Email us for details >> firstname.lastname@example.org
- Senior Data Scientist - Fors Marsh Group (FMG) - Arlington VA
FMG is seeking an intelligent and motivated Senior Data Scientist to support data science and data analytic projects. As a part of our Advanced Analytics division, the Senior Data Scientist will have the opportunity to provide support on a variety of behavioral research and data science projects, including work for government and private sector clients. The Senior Data Scientist’s primary responsibility will be to provide subject matter expertise on the technical aspects of data science projects and design and execute plans for processing, management, and analysis of complex, multi-source, and big (high volume, velocity, and variety) data. This individual should bring expertise in data mining, statistical learning, supervised and unsupervised machine learning, applied research, and client management experience. This job is best for someone who enjoys solving challenging analytic problems, has experience extracting insights from large and complex data sets, and thrives in a collaborative environment...
Training & Resources
Thinc, a new deep learning library by the makers of spaCy and FastAPI
Thinc is a lightweight deep learning library that offers an elegant, type-checked, functional-programming API for composing models, with support for layers defined in other frameworks such as PyTorch, TensorFlow or MXNet. You can use Thinc as an interface layer, a standalone toolkit or a flexible way to develop new models. Previous versions of Thinc have been running quietly in production in thousands of companies, via both spaCy and Prodigy. We wrote the new version to let users compose, configure and deploy custom models built with their favorite framework. The end result is a library quite different in its design, that’s easy to understand, plays well with others, and is a lot of fun to use...
A Gentle Introduction to Deep Learning for Graphs
This work is designed as a tutorial introduction to the field of deep learning for graphs. It favours a consistent and progressive introduction of the main concepts and architectural aspects over an exposition of the most recent literature, for which the reader is referred to available surveys. The paper takes a top-down view to the problem, introducing a generalized formulation of graph representation learning based on a local and iterative approach to structured information processing. It introduces the basic building blocks that can be combined to design novel and effective neural models for graphs. The methodological exposition is complemented by a discussion of interesting research challenges and applications in the field...
Data Science in Production: Building Scalable Model Pipelines with Python
This book provides a hands-on approach to scaling up Python code to work in distributed environments in order to build robust pipelines. Readers will learn how to set up machine learning models as web endpoints, serverless functions, and streaming pipelines using multiple cloud environments. It is intended for analytics practitioners with hands-on experience with Python libraries such as Pandas and scikit-learn, and will focus on scaling up prototype models to production....
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page
P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian