Data Science Weekly Newsletter - Issue 429

Issue #396

June 23 2021

Editor Picks
  • 8 Lessons from 20 Years of Hype Cycles
    I've always been fascinated by how new technologies emerge and come to market. One of the major artifacts that tries to capture the state of our market and industry each year is the annual Gartner Hype Cycle - which I always read with interest. Just last month, I had an interesting thought: "Has anyone gone back and done a retrospective of Gartner Hype Cycles - because I'd totally read that article". A quick Google search didn't turn up anything useful, so I decided I'd do the work and write it myself. This article is the result...
  • On the “usefulness” of the Netflix Prize
    It has been over 10 years since the Netflix Prize finished, and I was not expecting to write a blog post about it at this point...Given that there seems to be continued interest as well as misunderstanding around the prize and its outcome, I thought it might be worth to “set the record straight” in a dedicated post...TLDR; While I am often misquoted as having said that the Netflix Prize was not useful for Netflix, that is only true about the grand prize winning entry. Along the way, Netflix got far more than our money worth for the famous prize...

A Message from this week's Sponsor:


Managed Faiss: Production-ready similarity search an API call away

Focus on model and application development, not algorithm tuning or infrastructure engineering.

Pinecone provides a fully managed, cloud-native service for vector similarity search in production. Run Faiss inside Pinecone for scalability, reliability, easy-to-use APIs, and more features compared to self-hosted Faiss.

Why use managed Faiss:
  • Production-Ready: Go to production without breaking a sweat or slowing down.
  • Scalable & Distributed: Scale to billions of vectors with all the bells and whistles.
  • Managed Infrastructure: We obsess over operations and security so you don't have to.
Get early access for Managed Faiss by Pinecone.

Data Science Articles & Videos

  • How to Do Multi-Task Learning Intelligently
    Traditionally, a single machine learning model is devoted to one task, e.g. classifying images, which is known as single-task learning (STL). There are some advantages, however, to training models to make multiple kinds of predictions on a single sample, e.g. image classification and semantic segmentation. This is known as Multi-task learning (MTL). In this article, we discuss the motivation for MTL as well as some use cases, difficulties, and recent algorithmic advances...
  • Where Are Pixels? -- a Deep Learning Perspective
    Due to some sloppy code in the early days of deep learning libraries, today we're facing multiple versions of resize functions. Together with the two different coordinate system conventions, they easily cause hidden bugs in computer vision code...This article revisits these historical technical debts and shows how these fun details matter in modeling and training. I hope they will help you make proper choices...
  • Learning Neural Network Subspaces
    Recent observations have advanced our understanding of the neural network optimization landscape, revealing the existence of (1) paths of high accuracy containing diverse solutions and (2) wider minima offering improved performance. Previous methods observing diverse paths require multiple training runs. In contrast we aim to leverage both property (1) and (2) with a single method and in a single training run. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks...
  • How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
    I would like to share with you is an architectural perspective that underpins the failure of many data platform initiatives. I demonstrate how we can adapt and apply the learnings of the past decade in building distributed architectures at scale, to the domain of data; and I will introduce a new enterprise data architecture that I call data mesh...
  • Homemade Machine Learning
    The purpose of this repository is not to implement machine learning algorithms by using 3rd party library one-liners but rather to practice implementing these algorithms from scratch and get better understanding of the mathematics behind each algorithm. That's why all algorithms implementations are called "homemade" and not intended to be used for production...Python examples of popular machine learning algorithms with interactive Jupyter demos and math being explained...
  • Class imbalance and classification metrics with aircraft wildlife strikes
    This is the latest in my series of screencasts demonstrating how to use the tidymodels packages, from just starting out to tuning more complex models with many hyperparameters. I recently participated in SLICED, a competitive data science prediction challenge. I did not necessarily cover myself in glory but in today’s screencast, I walk through the data set on aircraft wildlife strikes we used and how different choices around handling class imbalance affect different classification metrics. ...
  • Dangers of Bayesian Model Averaging under Covariate Shift
    Approximate Bayesian inference for neural networks is considered a robust alternative to standard training, often providing good performance on out-of-distribution data. However, Bayesian neural networks (BNNs) with high-fidelity approximate inference via full-batch Hamiltonian Monte Carlo achieve poor generalization under covariate shift, even underperforming classical estimation. We explain this surprising result, showing how a Bayesian model average can in fact be problematic under covariate shift, particularly in cases where linear dependencies in the input features cause a lack of posterior contraction...
  • Deep reinforcement learning will transform manufacturing as we know it
    Deep reinforcement learning is strategic. It learns how to take a series of actions in order to reach a goal. That’s powerful and smart — and it’s going to change a lot of industries...Two industries on the cusp of AI transformations are manufacturing and supply chain. The ways we make and ship stuff are heavily dependent on groups of machines working together, and the efficiency and resiliency of those machines are the foundation of our economy and society. Without them, we can’t buy the basics we need to live and work...
  • Building data culture & embracing bad ideas with Hilary Mason
    Deep in Data Podcast Episode #4 with Hilary Mason...Hilary Mason is a seasoned practitioner. A former Chief Data Scientist at Bitly and co-founder at Fast Forward Labs, a machine intelligence research company that was acquired by Cloudera in 2017. Hilary was the General Manager of ML at Cloudera. Today, she holds a role as a Data Scientist in Residence at Accel Partners and is a co-founder at Hidden Door. Outside work, Hilary is a blogger, speaker, and author and dedicates her time to talent development as a Board member at the Anita Borg Institute for women in technology and at hackNY...



Sharpen your data skills by solving 3 questions per week – for free

Get data science interview questions frequently asked at top companies every Monday, Wednesday & Friday. Solve the problem before receiving the solution the next morning. Check your work and sharpen your skills! Join our free newsletter.

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!



  • Senior Data Scientist - WarnerMedia - New York, NY

    WarnerMedia is a leading media and entertainment company that creates and distributes premium and popular content from a diverse array of talented storytellers and journalists to global audiences through its consumer brands including: HBO, HBO Max, Warner Bros., TNT, TBS, truTV, CNN, DC Entertainment, New Line, Cartoon Network, Adult Swim, Turner Classic Movies and others.

    Reporting to the Sr. Manager, Data Science this role will help to develop the predictive insights and prescriptive capabilities behind CNN’s emerging products, transforming first- and third- party data into quantitative findings, visualizations, and automation

        Want to post a job here? Email us for details >>


Training & Resources

  • Algorithms for Private Data Analysis [Text & Video]
    This course is on algorithms for differentially private analysis of data. As necessitated by the nature of differential privacy, this course will be theoretically and mathematically based. References to practice will be provided as relevant, especially towards the end of the course. Prerequisites include an undergraduate understanding of algorithms, comfort with probability, and mathematical maturity...
  • 16 New ML Gems for Ruby
    I set out to improve the machine learning ecosystem for Ruby and wasn’t sure where it would go. Over the next 5 months, I ended up releasing 16 libraries and learned a lot along the way. I wanted to share some of that knowledge and introduce some of the libraries you can now use in Ruby...




  • Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

    Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

    P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian
Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.