Receive the Data Science Weekly Newsletter every Thursday
Easy to unsubscribe at any time. Your e-mail address is safe.
Data Science Weekly Newsletter
December 3, 2020
Specification gaming: the flip side of AI ingenuity
Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had experiences with specification gaming, even if not by this name...This problem also arises in the design of artificial agents. For example, a reinforcement learning agent can find a shortcut to getting lots of reward without completing the task as intended by the human designer...In this post, we review possible causes for specification gaming, share examples of where this happens in practice, and argue for further work on principled approaches to overcoming specification problems.
A Scalable Approach to Reducing Gender Bias in Google Translate
Machine learning (ML) models for language translation can be skewed by societal biases reflected in their training data. One such example, gender bias, often becomes more apparent when translating between a gender-specific language and one that is less-so. For instance, Google Translate historically translated the Turkish equivalent of “He/she is a doctor” into the masculine form, and the Turkish equivalent of “He/she is a nurse” into the feminine form...
Vettery is an online hiring marketplace that's changing the way people hire and get hired. Ready for a bold career move? Make a free profile, name your salary, and connect with hiring managers from top employers today.
Data Science Articles & Videos
A Graph Convolutional Neural Network Approach to Antibiotic Discovery
In an age when bacterial infections are developing resistance to common antibiotics, the discovery of a new and potentially powerful antibiotic is news in itself. But what makes a recent breakthrough truly revolutionary is that the promising molecule—called halicin—was discovered using deep learning....
Andrej Karpathy speaking about Tesla Autopilot at ScaledML2020 [Video]
Andrej Karpathy gave a talk at the ScaledML 2020 event in Feb about Tesla's approach to training neural networks for self-driving...He details how their team is tackling the long tail of corner cases taking advantage of dynamic data collection on a fleet of a million cars with bayesian and active learning methods...
Vector-Quantized Contrastive Predictive Coding for Template-based Music Generation
In this paper, we proposed a flexible method for generating variations of discrete sequences in which tokens can be grouped into basic units, like sentences in a text or bars in music. More precisely, given a template sequence, we aim at producing novel sequences sharing perceptible similarities with the original template without relying on any annotation. The novelty of our approach is to cast the problem of generating variations as a representation learning problem...
Things I Wished More Developers Knew About Databases
In data-heavy systems, databases are at the core of system design goals and tradeoffs. Even though it is impossible to ignore how databases work, the problems that application developers foresee and experience will often be just the tip of the iceberg. In this series, I’m sharing a few insights I specifically found useful for developers who are not specialized in this domain...
Building an end-to-end Speech Recognition model in PyTorch
Let's walk through how one would build their own end-to-end speech recognition model in PyTorch. The model we'll build is inspired by Deep Speech 2 (Baidu's second revision of their now-famous model) with some personal improvements to the architecture. The output of the model will be a probability matrix of characters, and we'll use that probability matrix to decode the most likely characters spoken from the audio....
Lip Reading - Cross Audio-Visual Recognition using 3D Architectures
Audio-visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multispeaker scenarios. The approach of AVR systems is to leverage the extracted information from one modality to improve the recognition ability of the other modality by complementing the missing information. The essential problem is to find the correspondence between the audio and visual streams, which is the goal of this paper. We propose the use of a coupled 3D convolutional neural network (3D CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio-visual streams using the learned multimodal features....
The Cost of Training NLP Models: A Concise Overview
We review the cost of training large-scale language models, and the drivers of these costs. The intended audience includes engineers and scientists budgeting their model-training experiments, as well as non-practitioners trying to make sense of the economics of modern-day Natural Language Processing (NLP)...
LA Traffic Data Analysis
As a 7-year Los Angeles resident, I’ve sat in more than my fair share of gridlock, seemingly regardless of time of day or day of week. That’s why I was so interested when I stumbled upon a traffic collision data set maintained by the city of Los Angeles...After browsing the data, I settled on 3 major questions I wanted to attempt to answer: 1) How do traffic collision patterns vary by time of day, day of week, and time of year?, 2) How are collisions distributed geographically? Is it possible to identify high-risk areas or intersections?, 3) Is it possible to predict the number of collisions in a given time frame?...
After helping hundred of readers like you get Data Science jobs, we've distilled all the real-world-tested advice into a self-directed course.
The course is broken down into three guides:
Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)
Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate
Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!
The Amazon Demand Forecasting team seeks a Data Scientist with strong analytical and communication skills to join our team. We develop sophisticated algorithms that involve learning from large amounts of data, such as prices, promotions, similar products, and a product's attributes, in order to forecast the demand of over 190 million products world-wide. These forecasts are used to automatically order more than $200 million worth of inventory weekly, establish labor plans for tens of thousands of employees, and predict the company's financial performance. The work is complex and important to Amazon. With better forecasts we drive down supply chain costs, enabling the offer of lower prices and better in-stock selection for our customers...
Want to post a job here? Email us for details >> email@example.com
Training & Resources
The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.
Coding habits for data scientists
Article link ] [
YouTube Video Series Link ]
If you’ve tried your hand at machine learning or data science, you know that code can get messy, quickly...Typically, code to train ML models is written in Jupyter notebooks and it’s full of (i) side effects (e.g. print statements, pretty-printed dataframes, data visualisations) and (ii) glue code without any abstraction, modularisation and automated tests. While this may be fine for notebooks targeted at teaching people about the machine learning process, in real projects it’s a recipe for unmaintainable mess. The lack of good coding habits makes code hard to understand and consequently, modifying code becomes painful and error-prone. This makes it increasingly difficult for data scientists and developers to evolve their ML solutions...In this article, we’ll share techniques for identifying bad habits that add to complexity in code as well as habits that can help us partition complexity....
A Visual Exploration of DeepCluster
Many self-supervised methods use pretext tasks to generate surrogate labels and formulate an unsupervised learning problem as a supervised one. Some examples include rotation prediction, image colorization, jigsaw puzzles etc. However, such pretext tasks are domain-dependent and require expertise to design them...DeepCluster is a self-supervised method proposed by Caron et al. of Facebook AI Research that brings a different approach. This method doesn’t require domain-specific knowledge and can be used to learn deep representations for scenarios where annotated data is scarce...
Data Science in Production: Building Scalable Model Pipelines with Python This book provides a hands-on approach to scaling up Python code to work in distributed environments in order to build robust pipelines. Readers will learn how to set up machine learning models as web endpoints, serverless functions, and streaming pipelines using multiple cloud environments. It is intended for analytics practitioners with hands-on experience with Python libraries such as Pandas and scikit-learn, and will focus on scaling up prototype models to production....
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page . P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian