Data Science Weekly Newsletter - Issue 406

Issue #374

Jan 21 2021

Editor Picks
  • Controlled Experiments - Why Bother?
    I spent some time earlier this year orchestrating a massive experiment for Firefox. We launched a bunch of new features with Firefox 80 and we wanted to understand whether these new features improved our metrics...In the process, I ended up talking with a bunch of Firefox engineers and explaining why we need to run a controlled experiment. There were a few questions that got repeated a lot, so I figure it's worth answering them here...This article is targeted at new data scientists or engineers interested in data...
  • Deep Learning in the Sciences
    In this episode of the Data Exchange Podcast I speak Bharath (“Bart”) Ramsundar, author and open source developer. While in graduate school, Bart created DeepChem, an open source project that aims to democratize deep learning for science. DeepChem historically was developed for researchers in the life sciences, so the working examples in its tutorials draw from areas like chemistry and bioinformatics...
  • AI in Drug Discovery 2020 - A Highly Opinionated Literature Review
    In this post, I present an annotated bibliography of some of the interesting machine learning papers I read in 2020...This list reflects a few interesting trends I saw this year...a) More of a practical focus on active learning, b) Efforts to address model uncertainty, as well as the admission that it's a very difficult problem, c) The (re)emergence of molecular representations that incorporate 3D structure, d) Several interesting strategies for data augmentation, e) Additional efforts toward model interpretability, coupled with the acknowledgment that this is also a difficult problem, and f) The application of generative models to more practical objectives (e.g. not LogP and QED)...

A Message from this week's Sponsor:


New Year, New Career

Jumpstart Your Career When You Apply to TDI’s Spring Data Science Fellowship Program

With The Data Incubator’s data science fellowship program, you’ll work closely with our expert instructors to master the in-demand data skills and programs you need to conquer the business world.

Our career service team will help you land a great job in data. And with our income sharing agreements, you won’t pay a cent in tuition until you get that job.

Attend full-time or part-time. Applications close on February 12.
Apply Now.


Data Science Articles & Videos

  • Machine Learning Models are Missing Contracts
    Why pretrained machine learning models are often unusable and irreproducible — and what we can do about it...A useful approach to designing software is through contracts. For every function in your codebase, you start by writing its contract: clearly specifying what inputs are expected and valid for that function (the precondition), and what the function will do (the postcondition) when provided an appropriate input...
  • Making sense of sensory input
    This paper attempts to answer a central question in unsupervised learning: what does it mean to “make sense” of a sensory sequence? In our formalization, making sense involves constructing a symbolic causal theory that both explains the sensory sequence and also satisfies a set of unity conditions. The unity conditions insist that the constituents of the causal theory – objects, properties, and laws – must be integrated into a coherent whole. On our account, making sense of sensory input is a type of program synthesis, but it is unsupervised program synthesis...
  • Three mysteries in deep learning: Ensemble, knowledge distillation, and self-distillation
    We focus on studying the discrepancy of neural networks during the training process that has arisen purely from randomizations. We ask the following questions: besides this small deviation in test accuracies, do the neural networks trained from different random initializations actually learn very different functions? If so, where does the discrepancy come from? How do we reduce such discrepancy and make the neural network more stable or even better? These questions turn out to be quite nontrivial, and they relate to the mysteries of three techniques widely used in deep learning...
  • Predicting drive failure & an introduction to machine learning
    We’ve all had a hard drive fail on us, and often it’s as sudden as booting your machine and realizing you can’t access a bunch of your files. It’s not a fun experience. It’s especially not fun when you have an entire data center full of drives that are all important to keeping your business running. What if we could predict when one of those drives would fail, and get ahead of it by preemptively replacing the hardware before the data is lost? This is where the history of predictive drive failure...begins...
  • ZeRO-Offload: Democratizing Billion-Scale Model Training
    Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training accessible to nearly everyone. It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework such as PyTorch, and it does so without requiring any model change from the data scientists or sacrificing computational efficiency. ZeRO-Offload enables large model training by offloading data and compute to CPU...
  • How Facebook is using AI to improve photo descriptions for people who are blind or visually impaired
    When Facebook users scroll through their News Feed, they find all kinds of content...many users who are blind or visually impaired (BVI) can also experience that imagery, provided it’s tagged properly with alternative text...Unfortunately, many photos are posted without alt text, so in 2016 we introduced a new technology called automatic alternative text (AAT)...The latest iteration of AAT ...makes it possible to include information about the positional location and relative size of elements in a photo. So instead of describing the contents of a photo as “May be an image of 5 people,” we can specify that there are two people in the center of the photo and three others scattered toward the fringes, implying that the two in the center are the focus...
  • A retrospective of NeurIPS 2020
    An incredible 23,000 people virtually attended the 2020 Conference on Neural Information Processing Systems, a highly regarded machine learning conference. Here you will find my personal, quite random, and definitely incomplete retrospective. Some of my favourite topics included model understanding, model compression, training bag of tricks, self-supervised learning for audio, a walk through the world of BERT, and indigenous in AI...
  • Prostate Cancer can be precisely diagnosed using a urine test with artificial intelligence
    Prostate cancer is one of the most common cancers among men. Patients are determined to have prostate cancer primarily based on PSA, a cancer factor in blood. However, as diagnostic accuracy is as low as 30%, a considerable number of patients undergo additional invasive biopsy and thus suffer from resultant side effects, such as bleeding and pain...The Korea Institute of Science and Technology (KIST) announced that the collaborative research...for diagnosing prostate cancer from urine within only 20 minutes with almost 100% accuracy. The research team developed this technique by introducing a smart AI analysis method to an electrical-signal-based ultrasensitive biosensor...

Data Platform*



We’re tired of seeing data scientists not getting paid for collecting data!

DoltHub is a platform for data collaboration that wants to pay you to source data. We recently launched a $10,000 bounty to collect the best open dataset of hospital prices! Get paid for every row you contribute: submit 20% of the dataset, get a $2,000 reward.

DoltHub makes data collaboration easy. Dolt databases can be forked, cloned, and merged just like Git repositories. That means multiple people can work on the same dataset without stomping on each other's changes.

Please see the link to the bounty here or join our Discord here.

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!



  • Data Scientist - Apple Pay Analytics - NYC

    You will play a key role improving the Apple Pay product experience. As a member of the analytics team you will be supporting a product function. You will partner with business owners, understand goals, craft KPIs and measure ongoing performance. You will initially engage with the product and engineering teams in ensuring that we have the appropriate instrumentation in place to deliver on these metrics. You will subsequently use advanced statistical, ML and analytical techniques to analyze product performance and identify key insights that inform product improvements and business strategy. The role requires a high degree of independence, ownership and collaboration working cross functionally across all levels of a highly matrixed organization...

        Want to post a job here? Email us for details >>


Training & Resources

  • ML Theory with bad drawings
    This semester I am teaching a seminar on the theory of machine learning. For the first lecture, I would like to talk about what is the theory of machine learning. I decided to write this (very rough!) blog post mainly to organize my own thoughts...
  • Book Review: Deep Learning With PyTorch
    After its release in August 2020, Deep Learning with PyTorch has been sitting on my shelf before I finally got a chance to read it during this winter break. It turned out to be the perfect easy-going reading material for a bit of productivity after the relaxing holidays. As promised last week, here are my thoughts...
  • SVM Classifier and RBF Kernel — How to Make Better Models in Python
    A complete explanation of the inner workings of Support Vector Machines (SVM) and Radial Basis Function (RBF) kernel...The story covers the following topics: a) The category of algorithms that SVM classification belongs to, b) An explanation of how the algorithm works, c) What are kernels, and how are they used in SVM?, and d) A closer look into RBF kernel with Python examples and graphs...



  • Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

    Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

    P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian
Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.