Data Science Weekly Newsletter - Issue 432

Issue #399

July 15 2021

Editor Picks
  • Charting the ‘Data for Good’ Landscape
    The field of “data for good” is not only overshadowed by the public conversations about the risks rampant data misuse can pose to civil society, it is also a fractured and disconnected space...We are taking one tiny step forward in trying to make a more coherent Data for Good space with a landscape that makes clear what various Data for Good initiatives (and AI for Good initiatives) are trying to achieve, how they do it, and what makes them similar or different from one another...Below you will find a very preliminary landscape map, along with a description of the different kinds of groups in the Data for Good ecosystem and why you might need to engage with them...
  • Postmortem: A Year of Data Science Peer Review in Startups
    About a year ago I suggested two peer review processes for data science projects, outlined a structure for the process — including separate review of the research phase and the model design and implementation — and positioned it within the wider scope of a data science project flow (as it is practiced in startup companies). The framework also included a list of topics, pitfalls and questions that should be reviewed...Unsurprisingly, things did not go exactly as planned. Thus, this post is about what worked and what didn’t...
  • Amazon's Data Dragnet
    Amazon’s vast offerings—which include online as well as brick-and-mortar stores, an internet pharmacy, streaming services, smart speakers, security cameras, mobile apps, and a digital advertising network—have helped the company expand into nearly every aspect of people’s lives...the Tech Transparency Project (TTP) conducted a review of the company’s privacy policies, patent applications, and other open-source information to assess the full scope of its surveillance capabilities. The findings show that Amazon is collecting far more data about its users than many people realize...

A Message from this week's Sponsor:


Data Science & Analytics Bootcamps to Fit Your Schedule

Join an Online Flex Data Science & Analytics Bootcamp where you can work on your own schedule with on-demand lectures, while still getting dedicated 1:1 instructor support. You’ll also get focused career support until you’re hired.

Ready to start your journey?

Learn more about the Metis Online Flex Data Science & Analytics Bootcamps



Data Science Articles & Videos

  • Identifying Document Types at Scribd
    This post walks through how we built a computer vision model to identify and classify over 100 million user-uploaded documents at Scribd. We talk about challenges faced such as defining classes, gathering data and dealing with overconfident predictions in production. ...
  • Getting Started With Facebook AI Similarity Search (FAISS)
    Facebook AI Similarity Search (FAISS) is one of the most popular implementations of efficient similarity search, but what is it — and how can we use it?..What is it that makes Faiss special? How do we make the best use of this incredible tool?...Fortunately, it’s a brilliantly simple process to get started with. And in this article, we’ll explore some of the options FAISS provides, how they work, and — most importantly — how Faiss can make our search faster...
  • What are Diffusion Models?
    Diffusion models are a new type of generative models that are flexible enough to learn any arbitrarily complex data distribution while tractable to analytically evaluate the distribution. It has been shown recently that diffusion models can generate high-quality images and the performance is competitive to SOTA GAN...
  • A Systematic Survey of Text Worlds as Embodied Natural Language Environments
    Text Worlds are virtual environments for embodied agents that, unlike 2D or 3D environments, are rendered exclusively using textual descriptions...This systematic survey outlines recent developments in tooling, environments, and agent modeling for Text Worlds, while examining recent trends in knowledge graphs, common sense reasoning, transfer learning of Text World performance to higher-fidelity environments, as well as near-term development targets that, once achieved, make Text Worlds an attractive general research paradigm for natural language processing. ...
  • Overview of process automation engagement trends in the Fortune 1000
    In this post, we take a look at process automation engagement trends in Fortune 1000 companies. We will examine three main categories of process automation: Business Process Automation (BPA), Robotics Process Automation (RPA) and Intelligent Process Automation (IPA), and review how the most-engaged companies are implementing automation technologies...
  • AI-Generating Algorithms: A Unique Opportunity for the Evolutionary RL Community [Video]
    A clear trend in machine learning is that hand-designed pipelines are replaced by higher-performing learned pipelines once sufficient compute and data are available. I argue that trend will apply to machine learning itself...In this talk I explain why I think AI-GAs are not only the fastest path to general AI, but also that they offer a tremendous and unique opportunity to the evolutionary RL community (including researchers focused on Quality Diversity and Open-Endedness). I’ll describe examples of successfully combining ideas from evolution and mainstream RL, including our Go-Explore and POET algorithms. I’ll also highlight many future research directions our community is uniquely poised to pursue to capitalize on this historic opportunity...
  • Multi-Task Learning with Deep Neural Networks: A Survey
    Multi-task learning (MTL) is a subfield of machine learning in which multiple tasks are simultaneously learned by a shared model...In this survey, we give an overview of multi-task learning methods for deep neural networks, with the aim of summarizing both the well-established and most recent directions within the field...
  • Is MLP-Mixer a CNN in Disguise?
    Recently, a new kind of architecture - MLP-Mixer: An all-MLP Architecture for Vision (Tolstikhin et al, 2021) - was proposed which claims to have competitive performance with SOTA models on ImageNet without using convolutions or attention. But is this really true? Are the token-mixing or channel-mixing layers in the MLP Mixer architecture actually "Conv-free"? (Figure-1) The deep learning community is split on this idea...
  • The Quick and Dirty Guide to Building Your Data Platform
    For most organizations, building a data platform is no longer a nice-to-have but a need-to-have, with many companies distinguishing themselves from the competition based on their ability to glean actionable insights from their data...To make things a little easier, we’ve outlined the 6 must-have layers you need to include in your data platform and the order in which many of the best teams choose to implement them...
  • My Journey to Deep Learning - [Video] by Jeremy Howard
    For 20 years I used a wide variety of machine learning and optimization algorithms to tackle predictive modeling challenges in many fields. But today, I find that deep learning gives me the best results for most problems I tackle, including solving problems that previously were out of reach. Furthermore, I find that deep learning generally requires less manual tweaking, leading to fewer errors and quicker results. Here, I discuss what I've learned on this journey, and describe why I believe nearly all data scientists should invest heavily in becoming effective deep learning practitioners...




Build Smart Data Pipelines with StreamSets Summer ‘21 Beta

Run your first smart data pipeline in StreamSets. It’s easy and completely free. Quickly build and deploy streaming, batch, CDC, ETL and ML pipelines. Handle data drift automatically so you can keep jobs running even when schemas and structures change. Deploy across hybrid and multi-cloud platforms.

Imagine less hands-on maintenance and reliable scalability so you can focus on responding to business requests and needs as quickly as possible.

Learn more about StreamSets Summer ‘21 Beta.

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!



  • Senior Data Scientist - WarnerMedia - New York, NY

    WarnerMedia is a leading media and entertainment company that creates and distributes premium and popular content from a diverse array of talented storytellers and journalists to global audiences through its consumer brands including: HBO, HBO Max, Warner Bros., TNT, TBS, truTV, CNN, DC Entertainment, New Line, Cartoon Network, Adult Swim, Turner Classic Movies and others.

    Reporting to the Sr. Manager, Data Science this role will help to develop the predictive insights and prescriptive capabilities behind CNN’s emerging products, transforming first- and third- party data into quantitative findings, visualizations, and automation

        Want to post a job here? Email us for details >>


Training & Resources

  • Random Matrix Theory and Machine Learning (ICML 2021 Tutorial)
    In recent years, random matrix theory (RMT) has come to the forefront of learning theory as a tool to understand some of its most important challenges. From generalization of deep learning models to a precise analysis of optimization algorithms, RMT provides analytically tractable models...
  • Gaussian Processes From scratch
    This post explores some concepts behind Gaussian processes, such as stochastic processes and the kernel function. We will build up deeper understanding of Gaussian process regression by implementing them from scratch using Python and NumPy...




  • Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

    Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

    P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian
Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.