Data Science Weekly Newsletter

Issue

397

July 1, 2021

‍

Editor's Picks

‍

The Batch Weekly Newsletter
Welcome to The Batch, a new weekly newsletter from deeplearning.ai! The Batch presents the most important AI events and perspective in a curated, easy-to-read report for engineers and business leaders. Every Wednesday, The Batch highlights a mix of the most practical research papers, industry-shaping applications, and high-impact business news...

Humans of AI: Stories, Not Stats - Podcast
In this series, I interview AI researchers to get to know them better as people...Our interaction with prominent AI researchers tends to be through the lens of their work. I believe it is valuable to see the human behind the work...I will not ask them any questions about their work or AI or technology. I also won’t ask them questions about the “stats” of their life like where they went to college...I will ask questions to try and understand who they are as a person, what their life is like, what they think about, what they are insecure about, what they get excited about. Questions that reveal the story of their day-to-day life...

Dynamic Data Testing: tests that learn with data
When testing data, our first instinct is to reach for perfection. Can’t we write down a clear set of rules that govern exactly how our data should behave, just like we do when testing software?...Of course we can’t! Data isn’t software, and shouldn’t be tested in the same way...To test data effectively we need tests that adapt...In this post, we outline a framework for data testing, from static tests that can be written in SQL, to dynamic tests that require statistics or machine learning. Then we compare both approaches with an example from COVID-19 data in the EU...

‍

A Message From This Week's Sponsor

‍

Data scientists are in demand on Vettery

Vettery is an online platform that connects you with thousands of actively hiring startups and Fortune 500 companies. Create a free profile, name your salary, and get discovered by hiring managers looking to grow their teams.
Get started - it’s completely free for job-seekers!

‍

Data Science Articles & Videos

‍

How ZSL uses ML to classify gunshots to protect wildlife
The analysis of acoustic (sound) data to support wildlife conservation is one of the major lines of work at ZSL’s (International conservation charity, ZSL (Zoological Society of London)) monitoring and technology programme. Compared to camera traps that are limited to detection at close range, acoustic sensors can detect events up to 1 kilometre (about half a mile) away. This has the potential to enable conservationists to track wildlife behaviour and threats over much greater areas...

Using GANs to Create Fantastical Creatures
Today, we present Chimera Painter, a trained machine learning (ML) model that automatically creates a fully fleshed out rendering from a user-supplied creature outline. Employed as a demo application, Chimera Painter adds features and textures to a creature outline segmented with body part labels, such as “wings” or “claws”, when the user clicks the “transform” button. Below is an example using the demo with one of the preset creature outlines...

FSD50K: an Open Dataset of Human-Labeled Sound Events
Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset...To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms)...

gradslam - an open-source framework for simultaneous localization and mapping (SLAM) systems
gradslam is a fully differentiable dense SLAM framework. It provides a repository of differentiable building blocks for a dense SLAM system, such as differentiable nonlinear least squares solvers, differentiable ICP (iterative closest point) techniques, differentiable raycasting modules, and differentiable mapping/fusion blocks. One can use these blocks to construct SLAM systems that allow gradients to flow all the way from the outputs of the system (map, trajectory) to the inputs (raw color/depth images, parameters, calibration, etc.)...Specifically, we implement differentiable versions three classical dense SLAM systems using the gradslam framework: KinectFusion, PointFusion, ICP-SLAM...

Beyond CUDA: GPU Accelerated Python for Machine Learning on Cross-Vendor Graphics Cards Made Simple
In this article you’ll learn how to write your own GPU accelerated algorithms in Python, which you will be able to run on virtually any GPU hardware — including non-NVIDIA GPUs. We’ll introduce core concepts and show how you can get started with the Kompute Python framework with only a handful of lines of code...

Towards ML Engineering: A Brief History Of TensorFlow Extended
ML Engineering, as a discipline, has not widely matured as much as its Software Engineering ancestor. Can we take what we have learned and help the nascent field of applied ML evolve into ML Engineering the way Programming evolved into Software Engineering? In this article we will give a whirlwind tour of Sibyl and TensorFlow Extended (TFX), two successive end-to-end (E2E) ML platforms at Alphabet. We will share the lessons learned from over a decade of applied ML built on these platforms, explain both their similarities and their differences, and expand on the shifts (both mental and technical) that helped us on our journey. In addition, we will highlight some of the capabilities of TFX that help realize several aspects of ML Engineering...

Isolation Forest is the best Anomaly Detection Algorithm for Big Data Right Now
Isolation forest or “iForest” is an astoundingly beautiful and elegantly simple algorithm that identifies anomalies with few parameters. The original paper is accessible to a broad audience and contains minimal math. In this article, I will explain why iForest is the best anomaly detection algorithm for big data right now, provide a summary of the algorithm, history of the algorithm and share a code implementation...

DeepMind Lab2D: A learning environment the creation of grid worlds
DeepMind Lab2D is a system for the creation of 2D environments for machine learning. The main goals of the system are ease of use and performance: The environments are "grid worlds", which are defined with a combination of simple text-based maps for the layout of the world, and Lua code for its behaviour. Machine learning agents interact with these environments through one of two APIs, the Python dm_env API or a custom C API (which is also used by DeepMind Lab). Multiple agents are supported...

Launch HN: Replicate (YC W20) – Version control for machine learning
Replicate is a lightweight open-source tool for tracking and analyzing your machine learning experiments...We spent a year talking to lots of people in the ML community and building all sorts of prototypes, but we kept on coming back to a foundational problem: not many people in machine learning use version control...This causes all sorts of problems: people are manually keeping track of things in spreadsheets, model weights are scattered on S3, and results can’t be reproduced...We came to the conclusion that we need a native version control system for ML. It’s sufficiently different to normal software that we can’t just put band-aids on Git...Replicate...is a Python library that uploads your files and metadata (like hyperparameters) to Amazon S3 or Google Cloud Storage. You can get back to any point in time using the command-line interface, analyze your results inside a notebook using the Python API, and load your models in production systems...

‍

Training

‍

Quick Question For You: Do you want a Data Science job?

After helping hundred of readers like you get Data Science jobs, we've distilled all the real-world-tested advice into a self-directed course.
The course is broken down into three guides:

Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)

Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate

Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!

Click here to learn more
...
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Product Analyst, Data Scientist - Google - New York, NY

Product Analysts provide quantitative analysis, market expertise and a strategic perspective to our partners throughout the organization. As a data-loving member of the team, you'll serve as an analytics expert for your partners, using numbers to help them make better decisions. You will weave stories with meaningful insight from data. You'll make key recommendations for your fellow Googlers in Engineering and Product Management.
As a Product Analyst, you relish tallying up the numbers one minute and communicating your findings to a team leader the next. You can see different angles of a product or business opportunity, and you know how to connect the dots and interact with people in various roles and functions. You will work to effectively turn business questions into data analysis, and provide meaningful recommendations on strategy....

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

Machine learning resources from the "End-to-End Machine Learning School"
If you want to study machine learning but don't have the luxury of attending university full-time, you're in luck. There is a wonderfully rich collection of courses, posts, videos, notebooks, and tutorials online. There is so much, in fact, that it can be hard to know where to start. I put together this guide as a starting place, a first foothold for anyone who wants to jump in...

130 Machine Learning Projects Solved and Explained
Practice your skills in Data Science Projects with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you. By learning and trying these projects on Data Science you will understand about the practical environment where you follow instructions in the real-time...

Charles proxy for web scraping
Charles proxy is an HTTP debugging proxy that can inspect network calls and debug SSL traffic. With Charles, you are able to inspect requests/responses, headers and cookies. Today we will see how to set up Charles, and how we can use Charles proxy for web scraping. We will focus on extracting data from Javascript-heavy web pages and mobile applications...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page
.

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍