Data Science Weekly Newsletter - Issue 422

Issue #389

May 06 2021

Editor Picks
  • The San Pellegrino label Moiré effect
    Have you noticed the nice Moiré effect in the San Pellegrino label?...Notice the wavy pattern which repeats itself. If you look closely it looks like it is obtained from the periodic repetition of a wavy curve and a line along two different directions. Hence the beating effect which corresponds to the fact that the two curves are out of phase (appears darker because more ink density) or in phase (appear lighter because the two curves are one upon the other)...In this notebook, we are going to reverse engineer how this wavy pattern is being obtained...
  • Hard choices: AI in health care
    Artificial intelligence will change the health care industry, not least by raising serious moral issues...Two of the most pressing current ethical considerations involve the potential loss of physician autonomy and the unconscious amplification of underlying biases...
  • Introducing Observable Plot
    We're excited to announce Observable Plot, a new open-source library for faster and easier data exploration on the web!...Plot's concise API and thoughtful defaults are designed for a more joyful visualization process...Plot is informed by ten years of maintaining D3 but does not replace it. We continue to support and develop D3, and recommend its low-level approach for bespoke explanatory visualizations and as a foundation for higher-level exploratory visualization tools. In fact, Plot is built on D3! Observable Plot is more akin to Vega-Lite, another great tool for exploration...

A Message from this week's Sponsor:


Online Data Science Programs from Drexel University

Find your algorithm for success with an online data science degree from Drexel University. Gain essential skills in tool creation and development, data and text mining, trend identification, and data manipulation and summarization by using leading industry technology to apply to your career.

Learn more.


Data Science Articles & Videos

  • Practical SQL for Data Analysis: What you can do without Pandas
    Pandas is a very popular tool for data analysis. It comes built-in with many useful features, it's battle tested and widely accepted. However, pandas is not always the best tool for the job...SQL databases has been around since the 1970s. Some of the smartest people in the world worked on making it easy to slice, dice, fetch and manipulate data quickly and efficiently. SQL databases have come such a long way, that many developers and data scientists lost track of what they can do with the database they already have!...In this article I demonstrate how to use SQL to perform fast and efficient data analysis...
  • L'art pour l'art: creating generative art with L-systems in Python
    "Would it be possible to create a Python project that can generate those pencil-like drawings with random forests, with different types of trees?", I thought (pun intended). Of course, someone already thought of that and it even has a name: algorithmic botany...Although a lot has been written on the subject and there are quite a few open source libraries, they weren't quite what I have in mind...Instead, the idea of creating generative art in Python emerged using the following components: a) A lightweight implementation of L-systems that also supports stochastic and parametric production rules and b) Integration with p5js via pyp5js as a web-native graphing engine...In this blog post, I will lay out the ideas behind this approach...
  • Is Natural Language Processing Ready to Take on Legal Hearings?
    Every year, California holds thousands of parole hearings for eligible prisoners...In each of those hearings, a 150-page transcript of the entire conversation is produced for the government and public to review. And most likely, that transcript will never be read...Machine learning opens the opportunity to devise a new approach: What if we could “read” thousands of hearing transcripts within minutes, writing out the most important factors for each case?...This approach would center on human discretionary judgment and use technology to ensure transparency and consistency...We call this the “Recon Approach” and believe it has applications well beyond parole...
  • Synthetic Data Generation Using Gaussian Mixture Model
    At a conceptual level, synthetic data is not real data, but data that has been generated from real data and that has the same statistical properties as the real data. This means that if an analyst works with a synthetic dataset, they should get analysis results similar to what they would get with real data...In this notebook, first we are going to look at the the differences between KMeans and GMM as Clustering Algorithms, we will be able to realize the power of GMM (Gaussian Mixture Model) to be used as a Density Estimator Model...
  • The art of solving problems with Monte Carlo simulations
    This article will explore some examples and applications of Monte Carlo simulations using the Go programming language. To keep this article fun and interactive, after each Go code provided, you will find a link to the Go Playground, where you can run it without installing Go on your machine...Put your adventure helmets on!...
  • Do Wide and Deep Networks Learn the Same Things?
    In “Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth”, we perform a systematic study of the similarity between wide and deep networks from the same architectural family through the lens of their hidden representations and final outputs. In very wide or very deep models, we find a characteristic block structure in their internal representations, and establish a connection between this phenomenon and model overparameterization...
  • Case Study: How Your Course Can Incorporate the Reproducibility Challenge
    The Machine Learning Reproducibility Challenge (MLRC) is an event hosted by Papers with Code designed to encourage the publishing and sharing of reproducible scientific results in machine learning (ML)...The University of Amsterdam incorporated the MLRC into a graduate level course for students in the Master AI study program...All in all, this was a great experience both for students and TAs, with 9 papers accepted at the MLRC...
  • DriveGAN: Towards a Controllable High-Quality Neural Simulation
    Realistic simulators are critical for training and verifying robotics systems. While most of the contemporary simulators are hand-crafted, a scaleable way to build simulators is to use machine learning to learn how the environment behaves in response to an action, directly from data. In this work, we aim to learn to simulate a dynamic environment directly in pixel-space, by watching unannotated sequences of frames and their associated action pairs. We introduce a novel high-quality neural simulator referred to as DriveGAN that achieves controllability by disentangling different components without supervision...



Similarity Search: An Introduction from Pinecone

Similarity search (or "vector search") is a new method of searching through big data. Unlike traditional search methods, it indexes and searches through vector representations of data. It uses a combination of deep learning models and state-of-the-art algorithms to find items by their conceptual meanings rather than keywords or properties.

The ability to search for similar items, and not just exact matches, makes many tasks as easy as an API call:
  • Show recommended products to customers
  • Show recommended content to users
  • Personalize search results
  • Deduplicate documents
  • Match records
  • Search by image, audio, or video
  • Detect anomalies
  • Question-answering
  • And much more...
Learn more about similarity search then deploy your own similarity search application with a few lines of code using Pinecone.

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!



  • Experimental Behavioral Scientist - BetterUp, Inc. - US-based, remote

    BetterUp is a mobile-based coaching platform that brings personalized professional coaching to employees at all levels. We help managers lead better, teams perform better, and employees thrive personally and inspire professionally.

    We are seeking an experimental behavioral scientist to join our team. In this role, you will direct a portfolio of original research to answer an essential question: What makes people happy and flourishing at work?

    You’ll draw on your experience as an experimental social scientist, statistician, and lover of all things Data, to uncover groundbreaking findings at an epicenter of human experience: life at work. Your work will inform BetterUp products, inspire our customers, inform the broader scientific community, and amplify BetterUp’s reputation as a global thought-leader.

        Want to post a job here? Email us for details >>


Training & Resources

  • Introduction to Reinforcement Learning
    In this tutorial, we aim to provide readers with a high-level overview of the fundamentals of RL as well as example code in Python, introducing the OpenAI Gym library. We begin with building intuitions about what is considered an RL problem and we introduce formal definitions as well as key terminologies that are used to describe and model an RL application. In parallel, we will focus on solving a concrete example of an RL problem (CartPole) using a classic RL algorithm called Q-learning. The fundamentals presented in this tutorial with respect to Q-learning were key in teaching neural networks to play Atari games, again by DeepMind in 2013. At the end of the tutorial, we present references for recommended further reading...
  • Paper Explained - Why AI is Harder Than We Think (Full Video Analysis) [Video]
    The AI community has gone through regular cycles of AI Springs, where rapid progress gave rise to massive overconfidence, high funding, and overpromise, followed by these promises being unfulfilled, subsequently diving into periods of disenfranchisement and underfunding, called AI Winters. This paper examines the reasons for the repeated periods of overconfidence and identifies four fallacies that people make when they see rapid progress in AI...




  • Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

    Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

    P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian
Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.