Receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe at any time. Your e-mail address is safe.

Data Science Weekly Newsletter
October 21, 2021

Editor's Picks

  • The philosophical and musical failings of “Beethoven X: The AI Project”
    When VAN asked me to do a review of an artificial-intelligence-created realization of Beethoven’s Tenth Symphony called “Beethoven X: The AI Project,” which is based on the skimpy sketches he left when he died, I more or less groaned in my reply. “Not for me,” I said. “I know pretty much what I’ll think about it, and my review could get snarky.” “If so, that would be all right with us,” VAN said. “Well, OK,” I groaned back. So here I am and here goes...At the end of the symphony I found myself more philosophical than annoyed. I’ll start with that...
  • MIT's "The Missing Semester of Your CS Education" Class
    Classes teach you all about advanced topics within CS, from operating systems to machine learning, but there’s one critical subject that’s rarely covered, and is instead left to students to figure out on their own: proficiency with their tools. We’ll teach you how to master the command-line, use a powerful text editor, use fancy features of version control systems, and much more!...
  • Predicting Spreadsheet Formulas from Semi-structured Contexts
    We describe a new model that learns to automatically generate formulas based on the rich context around a target cell. When a user starts writing a formula with the “=” sign in a target cell, the system generates possible relevant formulas for that cell by learning patterns of formulas in historical spreadsheets....

A Message From This Week's Sponsor

Kickstart Your New Career with a Data Science & Analytics Bootcamp

Don’t miss your chance to join a Data Scientist-led, online Metis bootcamp plus get career support until you’re hired. Bootcamps are starting soon! Ready to take your data science or analytics career to the next level? Learn more about the Metis Online Data Science & Analytics Bootcamps.

Data Science Articles & Videos

  • Explaining in Style: Training a GAN to explain a classifier in StyleSpace
    We propose a training procedure for a StyleGAN, which incorporates the classifier model, in order to learn a classifier-specific StyleSpace. Explanatory attributes are then selected from this space. These can be used to visualize the effect of changing multiple attributes per image, thus providing image-specific explanations. We apply StylEx to multiple domains, including animals, leaves, faces and retinal images. For these, we show how an image can be modified in different ways to change its classifier output. Our results show that the method finds attributes that align well with semantic ones, generate meaningful image-specific explanations, and are human-interpretable as measured in user-studies....
  • Challenges in Detoxifying Language Models
    Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the safety of generated text is imperative for deploying LMs in the real world; to this end, prior work often relies on automatic evaluation of LM toxicity...We demonstrate that while basic intervention strategies can effectively optimize previously established automatic metrics on the RealToxicityPrompts dataset, this comes at the cost of reduced LM coverage for both texts about, and dialects of, marginalized groups. Additionally, we find that human raters often disagree with high automatic toxicity scores for texts generated by models with strong toxicity reduction interventions...
  • ETL Pipelines with Airflow: the Good, the Bad and the Ugly
    In this article, we review how to use Airflow ETL operators to transfer data from Postgres to BigQuery with the ETL and ELT paradigms. Then, we share some challenges you may encounter when attempting to load data incrementally with Airflow DAGs. Finally, we argue why Airflow ETL operators won’t be able to cover the long tail of integrations for your business data...
  • Considerations Before Pushing Machine Learning Models to Production
    I daily see, as a data scientist, the challenges that come with putting AI-based solutions in production. These challenges are numerous and cover a variety of aspects: modeling and system design, data engineering, resource management, SLA, etc...I don’t pretend mastery in any of those fields. I do however know that implementing some software engineering principles and using the right tools helped me a lot in making my work reproducible and ready for production...In this article, I’ll share with you 7 of the considerations I have in mind before productionizing my models....
  • Generative art resources in R
    An extremely incomplete (and probably biased) list of resources to help an aspiring generative artist get started making pretty pictures in R...
  • Who is a Data Scientist in 2021?
    Every year we publishe a study on 1,001 data scientist profiles. The information is collected from public LinkedIn profiles, assuming that the information posted on the social media platform is an unbiased estimator of their resume...This research allows us to gain insights, with a reasonable degree of certainty, about who is a data scientist in 2021. We present only aggregate data to highlight important trends that can be useful to anyone who wants to break into the field, as well as to organizations looking to hire data scientists....


Create AI-powered search and recommendation apps with Pinecone Pinecone is a fully managed vector database that makes it easy to add vector search to production applications. It combines state-of-the-art vector search libraries, advanced features such as filtering, and distributed infrastructure to provide high performance and reliability at any scale. Get started now — it's free! *Sponsored post. If you want to be featured here, or as our main sponsor, contact us!


Training & Resources

  • Carnegie Mellon University 10721: Philosophical Foundations of Machine Intelligence
    What is this field? What are its normative aims? What are its modes of inquiry? What are (and have been) its intellectual and ideological commitments? What foundational questions is it in dialogue with, and what foundational obstacles obstruct its progress? Finally: What are our responsibilities as researchers & practitioners deploying this technology?...
  • SHAP: Explain Any Machine Learning Model in Python
    Imagine you are trying to train a machine learning model to predict whether an ad is clicked by a particular person. After receiving some information about a person, the model predicts that a person will not click on an ad...But why does the model predict that? How much does each feature contribute to the prediction? Wouldn’t it be nice if you can see a plot indicating how much each feature contributes to the prediction?...That is when Shapley value comes in handy...
  • Random Forests Algorithm explained with a real-life example and some Python code
    Random Forests is a Machine Learning algorithm that tackles one of the biggest problems with Decision Trees: variance...Even though Decision Trees is simple and flexible, it is greedy algorithm. It focuses on optimizing for the node split at hand, rather than taking into account how that split impacts the entire tree. A greedy approach makes Decision Trees run faster, but makes it prone overfitting...An overfit tree is highly optimized to predicting the values in the training dataset, resulting in a learning model with high-variance. How you calculate variance in a Decision Tree depends on the problem you’re solving...


P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

Easy to unsubscribe at any time. Your e-mail address is safe.