Data Science Weekly Newsletter - Issue 397

Issue #365

Nov 19 2020

Editor Picks
  • The Batch Weekly Newsletter
    Welcome to The Batch, a new weekly newsletter from! The Batch presents the most important AI events and perspective in a curated, easy-to-read report for engineers and business leaders. Every Wednesday, The Batch highlights a mix of the most practical research papers, industry-shaping applications, and high-impact business news...
  • Humans of AI: Stories, Not Stats - Podcast
    In this series, I interview AI researchers to get to know them better as people...Our interaction with prominent AI researchers tends to be through the lens of their work. I believe it is valuable to see the human behind the work...I will not ask them any questions about their work or AI or technology. I also won’t ask them questions about the “stats” of their life like where they went to college...I will ask questions to try and understand who they are as a person, what their life is like, what they think about, what they are insecure about, what they get excited about. Questions that reveal the story of their day-to-day life...
  • Dynamic Data Testing: tests that learn with data
    When testing data, our first instinct is to reach for perfection. Can’t we write down a clear set of rules that govern exactly how our data should behave, just like we do when testing software?...Of course we can’t! Data isn’t software, and shouldn’t be tested in the same way...To test data effectively we need tests that adapt...In this post, we outline a framework for data testing, from static tests that can be written in SQL, to dynamic tests that require statistics or machine learning. Then we compare both approaches with an example from COVID-19 data in the EU...

A Message from this week's Sponsor:


Data scientists are in demand on Vettery

Vettery is an online platform that connects you with thousands of actively hiring startups and Fortune 500 companies. Create a free profile, name your salary, and get discovered by hiring managers looking to grow their teams.

Get started - it’s completely free for job-seekers!


Data Science Articles & Videos

  • How ZSL uses ML to classify gunshots to protect wildlife
    The analysis of acoustic (sound) data to support wildlife conservation is one of the major lines of work at ZSL’s (International conservation charity, ZSL (Zoological Society of London)) monitoring and technology programme. Compared to camera traps that are limited to detection at close range, acoustic sensors can detect events up to 1 kilometre (about half a mile) away. This has the potential to enable conservationists to track wildlife behaviour and threats over much greater areas...
  • Using GANs to Create Fantastical Creatures
    Today, we present Chimera Painter, a trained machine learning (ML) model that automatically creates a fully fleshed out rendering from a user-supplied creature outline. Employed as a demo application, Chimera Painter adds features and textures to a creature outline segmented with body part labels, such as “wings” or “claws”, when the user clicks the “transform” button. Below is an example using the demo with one of the preset creature outlines...
  • FSD50K: an Open Dataset of Human-Labeled Sound Events
    Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset...To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms)...
  • gradslam - an open-source framework for simultaneous localization and mapping (SLAM) systems
    gradslam is a fully differentiable dense SLAM framework. It provides a repository of differentiable building blocks for a dense SLAM system, such as differentiable nonlinear least squares solvers, differentiable ICP (iterative closest point) techniques, differentiable raycasting modules, and differentiable mapping/fusion blocks. One can use these blocks to construct SLAM systems that allow gradients to flow all the way from the outputs of the system (map, trajectory) to the inputs (raw color/depth images, parameters, calibration, etc.)...Specifically, we implement differentiable versions three classical dense SLAM systems using the gradslam framework: KinectFusion, PointFusion, ICP-SLAM...
  • Towards ML Engineering: A Brief History Of TensorFlow Extended
    ML Engineering, as a discipline, has not widely matured as much as its Software Engineering ancestor. Can we take what we have learned and help the nascent field of applied ML evolve into ML Engineering the way Programming evolved into Software Engineering? In this article we will give a whirlwind tour of Sibyl and TensorFlow Extended (TFX), two successive end-to-end (E2E) ML platforms at Alphabet. We will share the lessons learned from over a decade of applied ML built on these platforms, explain both their similarities and their differences, and expand on the shifts (both mental and technical) that helped us on our journey. In addition, we will highlight some of the capabilities of TFX that help realize several aspects of ML Engineering...
  • Isolation Forest is the best Anomaly Detection Algorithm for Big Data Right Now
    Isolation forest or “iForest” is an astoundingly beautiful and elegantly simple algorithm that identifies anomalies with few parameters. The original paper is accessible to a broad audience and contains minimal math. In this article, I will explain why iForest is the best anomaly detection algorithm for big data right now, provide a summary of the algorithm, history of the algorithm and share a code implementation...
  • DeepMind Lab2D: A learning environment the creation of grid worlds
    DeepMind Lab2D is a system for the creation of 2D environments for machine learning. The main goals of the system are ease of use and performance: The environments are "grid worlds", which are defined with a combination of simple text-based maps for the layout of the world, and Lua code for its behaviour. Machine learning agents interact with these environments through one of two APIs, the Python dm_env API or a custom C API (which is also used by DeepMind Lab). Multiple agents are supported...
  • Launch HN: Replicate (YC W20) – Version control for machine learning
    Replicate is a lightweight open-source tool for tracking and analyzing your machine learning experiments...We spent a year talking to lots of people in the ML community and building all sorts of prototypes, but we kept on coming back to a foundational problem: not many people in machine learning use version control...This causes all sorts of problems: people are manually keeping track of things in spreadsheets, model weights are scattered on S3, and results can’t be reproduced...We came to the conclusion that we need a native version control system for ML. It’s sufficiently different to normal software that we can’t just put band-aids on a Python library that uploads your files and metadata (like hyperparameters) to Amazon S3 or Google Cloud Storage. You can get back to any point in time using the command-line interface, analyze your results inside a notebook using the Python API, and load your models in production systems...



Quick Question For You: Do you want a Data Science job?

After helping hundred of readers like you get Data Science jobs, we've distilled all the real-world-tested advice into a self-directed course.

The course is broken down into three guides:
  1. Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)

  2. Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate

  3. Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!
Click here to learn more ...

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!



  • Product Analyst, Data Scientist - Google - New York, NY

    Product Analysts provide quantitative analysis, market expertise and a strategic perspective to our partners throughout the organization. As a data-loving member of the team, you'll serve as an analytics expert for your partners, using numbers to help them make better decisions. You will weave stories with meaningful insight from data. You'll make key recommendations for your fellow Googlers in Engineering and Product Management.

    As a Product Analyst, you relish tallying up the numbers one minute and communicating your findings to a team leader the next. You can see different angles of a product or business opportunity, and you know how to connect the dots and interact with people in various roles and functions. You will work to effectively turn business questions into data analysis, and provide meaningful recommendations on strategy....

        Want to post a job here? Email us for details >>


Training & Resources

  • Machine learning resources from the "End-to-End Machine Learning School"
    If you want to study machine learning but don't have the luxury of attending university full-time, you're in luck. There is a wonderfully rich collection of courses, posts, videos, notebooks, and tutorials online. There is so much, in fact, that it can be hard to know where to start. I put together this guide as a starting place, a first foothold for anyone who wants to jump in...
  • 130 Machine Learning Projects Solved and Explained
    Practice your skills in Data Science Projects with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you. By learning and trying these projects on Data Science you will understand about the practical environment where you follow instructions in the real-time...
  • Charles proxy for web scraping
    Charles proxy is an HTTP debugging proxy that can inspect network calls and debug SSL traffic. With Charles, you are able to inspect requests/responses, headers and cookies. Today we will see how to set up Charles, and how we can use Charles proxy for web scraping. We will focus on extracting data from Javascript-heavy web pages and mobile applications...



  • Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

    Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

    P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian
Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.