Data Science Weekly Newsletter

Issue

396

June 24, 2021

‍

Editor's Picks

‍

Data Organization in Spreadsheets
Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this article offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, do not leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, do not include calculations in the raw data files, do not use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plain text files...

WiDS Houston 2020
Women in Data Science Houston 2020 took place virtually on Friday, Oct 23, 2020 from 1-4:30pm Central. These are recordings of talks during the event...

'It's the screams of the damned!' The eerie AI world of deepfake music
Artificial intelligence is being used to create new songs seemingly performed by Frank Sinatra and other dead stars. ‘Deepfakes’ are cute tricks – but they could change pop for ever...The song in question not a genuine track, but a convincing fake created by “research and deployment company” OpenAI, whose Jukebox project uses artificial intelligence to generate music, complete with lyrics, in a variety of genres and artist styles...Legal departments in the music industry are following developments closely. Earlier this year, Roc Nation filed DMCA takedown requests against an anonymous YouTube user for using AI to mimic Jay-Z’s voice and cadence to rap Shakespeare and Billy Joel. (Both are incredibly realistic.)...

‍

A Message From This Week's Sponsor

‍

Proactively monitoring your AI performance with Mona

Mona is a SaaS platform that enables teams to proactively monitor data and model performance in production for biases, concept drifts, and data integrity issues. Mona takes a platform approach to monitoring, placing the data scientist/ML engineer in control with flexible configuration to ensure each team's unique monitoring needs are met. Mona can be deployed in 2 hours or less on any tech stack and in any ML use-case.

‍

Data Science Articles & Videos

‍

Self-training and Pre-training are Complementary for Speech Recognition
Bootstrap Your Own Latent (BYOL) is a self-supervised learning approach for image representation. From an augmented view of an image, BYOL trains an online network to predict a target network representation of a different augmented view of the same image. Unlike contrastive methods, BYOL does not explicitly use a repulsion term built from negative pairs in its training objective. Yet, it avoids collapse to a trivial, constant representation. Thus, it has recently been hypothesized that batch normalization (BN) is critical to prevent collapse in BYOL...we experimentally show that replacing BN with a batch-independent normalization scheme...achieves performance comparable to vanilla BYOL...Our finding disproves the hypothesis that the use of batch statistics is a crucial ingredient for BYOL to learn useful representations...

Fight San Francisco Crime with fast.ai and Deepnote
When most people picture San Francisco and the Bay Area, various positive connotations such as the Golden Gate Bridge, Chinatown, and software companies come to mind. However, like any metropolitan area, its dense population and wealth gap leads to an environment of lots of crime. Thanks to initiatives such as SF OpenData and Kaggle’s San Francisco Crime Classification competition, data compiled from all of this crime can be leveraged to better handle and respond to it...In particular, this article will focus on how Deepnote’s Jupyter-backed notebook environment and fast.ai’s effective encapsulation of machine learning data preparation greatly improve data scientist efficiency...

Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering
Differentiable rendering has paved the way to training neural networks to perform "inverse graphics" tasks such as predicting 3D geometry from monocular photographs. To train high performing models, most of the current approaches rely on multi-view imagery which are not readily available in practice...In this paper, we aim to extract and disentangle 3D knowledge learned by generative models by utilizing differentiable renderers. Key to our approach is to exploit GANs as a multi-view data generator to train an inverse graphics network using an off-the-shelf differentiable renderer, and the trained inverse graphics network as a teacher to disentangle the GAN's latent code into interpretable 3D properties....

Sensors driven by machine learning sniff-out gas leaks fast
A new study confirms the success of a natural-gas leak-detection tool that uses sensors and machine learning to locate leak points at oil and gas fields, promising new automatic, affordable sampling across vast natural gas infrastructure...

Graph Kernels: State-of-the-Art and Future Challenges
Graph-structured data are an integral part of many application domains, including chemoinformatics, computational biology, neuroimaging, and social network analysis. Over the last two decades, numerous graph kernels, i.e. kernel functions between graphs, have been proposed to solve the problem of assessing the similarity between graphs, thereby making it possible to perform predictions in both classification and regression settings. This manuscript provides a review of existing graph kernels, their applications, software plus data resources, and an empirical comparison of state-of-the-art graph kernels...

Preprocessing for deep learning: from covariance matrix to image whitening
The goal of this post/notebook is to go from the basics of data preprocessing to modern techniques used in deep learning. My point is that we can use code (Python/NumPy etc.) to better understand abstract mathematical notions...Here is the syllabus of this tutorial: 1) Background: In the first part, we will get some reminders about variance and covariance and see how to generate and plot fake data to get a better understanding of these concepts...2) Preprocessing: In the second part, we will see the basics of some preprocessing techniques that can be applied to any kind of data: mean normalization, standardisation and whitening...3) Whitening images: In the third part, we will use the tools and concepts gained in 1. and 2. to do a special kind of whitening called Zero Component Analysis (ZCA). It can be used to preprocess images for deep learning...

Vision Transformer
In this repository we release models from the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale that were pre-trained on the ImageNet-21k (imagenet21k) dataset. We provide the code for fine-tuning the released models in Jax/Flax...

Testing Natural Language Models
In this episode of the Data Exchange [Podcast] I speak with Marco Ribeiro, Senior Researcher at Microsoft Research, and lead author of the award-winning paper ”Beyond Accuracy: Behavioral Testing of NLP models with CheckList”. As machine learning gains importance across many application domains and industries, there is a growing need to formalize how ML models get built, deployed, and used. MLOps is an emerging set of practices focused on productionizing the machine learning lifecycle, that draws ideas from CI/CD. But even before we talk about deploying a model to production, how do we inject more rigor into the model development process?...

Estimating the Impact of Training Data with Reinforcement Learning
In “Data Valuation Using Deep Reinforcement Learning”, accepted at ICML 2020, we address the challenge of quantifying the value of training data using a novel approach based on meta-learning. Our method integrates data valuation into the training procedure of a predictor model that learns to recognize samples that are more valuable for the given task, improving both predictor and data valuation performance. We have also launched four AI Hub Notebooks that exemplify the use cases of DVRL and are designed to be conveniently adapted to other tasks and datasets, such as domain adaptation, corrupted sample discovery and robust learning, transfer learning on image data and data valuation...

‍

Training

‍

Science to Data Science - five weeks of commercial, project-based training to launch your career

S2DS Virtual March 2021 applications are now open. You'll work on a commercial project, receive CV guidance, acquire business acumen, access to technical mentorship and lifelong career-support, plus join our community of 750+ data scientists that have graduated from our Bootcamp. Know an aspiring data scientist? We are also running a referral scheme, giving £100 for every successful applicant referred, with no limit on referrals. Just make sure the person you refer puts your name and details on the relevant section of the application form. Applications close 23rd Dec 2020. Apply now
.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Product Analyst, Data Scientist - Google - New York, NY

Product Analysts provide quantitative analysis, market expertise and a strategic perspective to our partners throughout the organization. As a data-loving member of the team, you'll serve as an analytics expert for your partners, using numbers to help them make better decisions. You will weave stories with meaningful insight from data. You'll make key recommendations for your fellow Googlers in Engineering and Product Management.
As a Product Analyst, you relish tallying up the numbers one minute and communicating your findings to a team leader the next. You can see different angles of a product or business opportunity, and you know how to connect the dots and interact with people in various roles and functions. You will work to effectively turn business questions into data analysis, and provide meaningful recommendations on strategy....

‍

Training & Resources

‍

The Original Transformer
This repo contains PyTorch implementation of the original transformer paper (Vaswani et al.)...It's aimed at making it easy to start playing and learning about transformers...You probably heard of transformers one way or another. GPT-3 and BERT to name a few well known ones...The main idea is that they showed that you don't have to use recurrent or convolutional layers and that simple architecture coupled with attention is super powerful. It gave the benefit of much better long-range dependencies modeling and the architecture itself is very parallelizable...which leads to higher compute efficiency!...Here is how their beautifully simple architecture looks like...

Don’t Make These 5 Mistakes with SQL
SQL and Machine Learning have a few things in common. It’s easy to start with one as it doesn’t require a lot of coding. Also, code rarely crashes...I would argue that the fact that the SQL queries don’t crash makes the Data Analysis even harder. How many datasets I’ve extracted from the database, that turned out to have wrong or missing data? Many!...If the code would simply crash, I’d know I screw it up. Data Scientists need to spend a considerable amount of time on data validation because an SQL query always returns something...These are the 5 mistakes you should avoid when writing SQL queries...

Comprehensive Project Based Data Science Curriculum
The curriculum presented here offers a mix of best in class resources and a suggested path to complete them in order to become a data scientist. It is intended to be a complete education in data science using online materials and is an alternative to getting a Masters degree. All resources have been heavily researched and used by myself in my journey of becoming a Data Scientist & a Deep Learning practitioner...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page
.

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍