Data Science Weekly Newsletter

Issue

431

February 24, 2022

‍

Editor's Picks

‍

A Gentle Introduction to Vector Databases
In this blog post, I’ll introduce concepts related to the vector database, a new type of technology designed to store, manage, and search embedding vectors. Vector databases are being used in an increasingly large number of applications, including but not limited to image search, recommender system, text understanding, video summarization, drug discovery, stock market analysis, and much more...

(Hopefully almost) everything you need to know about data science interviews (EU perspective) [Reddit Discussion]
So I’ve recently dived into job search again. Hadn’t really interviewed a lot since more than 3 years and well yeah, the market has changed a lot. Have a total of 5 YoE + STEM PhD which means this experience is probably not generalisable, but I hope these insights will be helpful for some...

I no longer believe that an MS in Statistics is an appropriate route for becoming a Data Scientist [Reddit Discussion]
When I was working as a data scientist (with a BS), I believed somewhat strongly that Statistics was the proper field for training to become a data scientist--not computer science, not data science, not analytics. Statistics...However, now that I'm doing a statistics MS, my perspective has completely flipped...

‍

A Message From This Week's Sponsor

‍

Retool is the fast way to build an interface for any database With Retool, you don't need to be a developer to quickly build an app or dashboard on top of any data set. Data teams at companies like NBC use Retool to build any interface on top of their data—whether it's a simple read-write visualization or a full-fledged ML workflow. Drag and drop UI components—like tables and charts—to create apps. At every step, you can jump into the code to define the SQL queries and JavaScript that power how your app acts and connects to data. The result—less time on repetitive work and more time to discover insights.

‍

Data Science Articles & Videos

‍

One Voice Detector to Rule Them All
In this article we will tell you about Voice Activity Detection in general, describe our approach to VAD metrics, and show how to use our VAD and test it on your own voice...

Tools and Recommendations for Reproducible Teaching
It is recommended that teacher-scholars of data science adopt reproducible workflows in their research as scholars and teach reproducible workflows to their students. In this paper, we propose a third dimension to reproducibility practices and recommend that regardless of whether they teach reproducibility in their courses or not, data science instructors adopt reproducible workflows for their own teaching. We consider computational reproducibility, documentation, and openness as three pillars of reproducible teaching framework. We share tools, examples, and recommendations for the three pillars...

Beyond Precision: Expressiveness in Visualization
In recent years, I have grown increasingly dissatisfied with the way we teach and talk about data visualization – at least from what I observe in academic settings. In particular, I am concerned with the predominant paradigm that visualization can and should be designed according to how precisely a given visual encoding can represent data. The story we tell ourselves (and the same story I tell with increasing discomfort to my students) goes a little like this...

An introduction to the deceit of statistical significance without p-values
A recent Twitter quiz asked “what is a powerful concept from your field that, if more people understood it, their lives would be better?” Unambiguously, the answer from my field is statistical significance...Here, I’ll explain in as plain terms as I can what statistical significance means in almost every published scientific study. I’ll do this without ever defining a p-value, as p-values have nothing to do with the way significance testing is used. Instead, significance testing amounts to hand wavy arguments about precision and variability. Laying it out this way shows why the authority granted to significance testing is so suspect and unearned...

Transfer Learning on Greyscale Images: How to Fine-Tune Pretrained Models on Black-and-White Datasets
In this article, we shall attempt to demystify all of the considerations needed when finetuning with black-and-white images by exploring the difference between RGB and greyscale images, and how these formats affect the processing operations done by convolutional neural network models, before demonstrating how to use greyscale images with pretrained models. We shall finish by examining the performance of the different approaches explored on some open source datasets and compare this to training from scratch on greyscale images...

Graph Theory and Linear Algebra
Graphs are an incredibly versatile structure insofar as they can model everything from the modernity of computer science and complexity of geography, to the intricacy of linguistic relationships and the universality of chemical structures...This paper explores the relationships between graph theory, their associated matrix representations, and the matrix properties found in linear algebra...In order to achieve this goal, this paper presents some of the most interesting theorems regarding matrix representations of graphs, and ties these theorems back to questions in graph theory itself....

An Introduction to Neural Data Compression
Neural compression is the application of neural networks and other machine learning methods to data compression. While machine learning deals with many concepts closely related to compression, entering the field of neural compression can be difficult due to its reliance on information theory, perceptual metrics, and other knowledge specific to the field. This introduction hopes to fill in the necessary background by reviewing basic coding topics such as entropy coding and rate-distortion theory, related machine learning ideas such as bits-back coding and perceptual metrics, and providing a guide through the representative works in the literature so far...

What are the Most Important Statistical Ideas of the Past 50 Years?
We review the most important statistical ideas of the past half century, which we categorize as: counterfactual causal inference, bootstrapping and simulation-based inference, overparameterized models and regularization, Bayesian multilevel models, generic computation algorithms, adaptive decision analysis, robust inference, and exploratory data analysis. We discuss key contributions in these subfields, how they relate to modern computing and big data, and how they might be developed and extended in future decades. The goal of this article is to provoke thought and discussion regarding the larger themes of research in statistics and data science...

DeepMind - The Podcast, Episode: Me, myself and AI
AI doesn’t just exist in the lab, it’s already solving a range of problems in the real world. In this episode, Hannah encounters a realistic recreation of her voice by WaveNet, the voice synthesising system that powers the Google Assistant and helps people with speech difficulties and illnesses regain their voices. Hannah also discovers how ‘deepfake’ technology can be used to improve weather forecasting and how DeepMind researchers are collaborating with Liverpool Football Club, aiming to take sports to the next level...

Dive into Deep Learning Compilers
This project is for readers who are interested in high-performance implementation of their programs utilizing deep learning techniques...In the first part, we will introduce how to implement and optimize operators, such as matrix multiplication and convolution, for various hardware platforms...In the second part, we will show how to convert neural network models from various deep learning frameworks and further optimize them in the program level. The last part we will address how to deploy the optimized program into various environment such as mobile phones...In addition, at the end of the book, we plan to cover some latest advance of the deep learning compiler domain...

Things that upset you as a Data scientist [Reddit Discussion]
I have been a Data scientist since seven years. There are several challenges we face everyday and Till this day, something that absolutely upsets me is not having a single good IDE for prototyping and production development. I constantly see myself switching between Jupyterlab and VScode and it's really annoying!...Anyways, I just want to hear what are the other biggest pain points you face as a Data scientist in your everyday work that absolutely upset you!...

‍

Forum

‍

Check out the new Anaconda Community for all-things data! Want insights into the newest developments in the world of data, or need help getting “unstuck” on a problem? Our Community Forums is the place to go! Be the first to engage with other professionals and ask questions to the broader data community. Users can join in conversations around trends, debate new features, post questions to the community, and more. Plus, it’s another avenue for technical help! Create your free Anaconda Community account now.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

(Senior) Analytics Engineer - Fabulous - Remote Fabulous is a mobile app helping thousands of people every day to change their lifestyles by integrating healthy habits into their lives. Fabulous is using a behavioral economics lens to help everyone achieve their fullest potential. We work closely with researchers based at Duke University and our advisor is Dan Ariely, author of NYT bestseller Predictably Irrational. We are looking for an experienced Analytics Engineer to consolidate the Data Science team and lead the development and enrichment of our Data Pipelines. We have a modern Data-Stack based on Fivetran, dbt, BigQuery, Amplitude, Metabase...

‍

Training & Resources

‍

Linear & Polynomial Regression: Exploring Some Red Flags For Models That Underfit
The purpose of this project is to observe some of the red flags for a model that is severely underfitting to the data and how these red flags change when fitting a more appropriate model...The red flags that I’ll be considering are: a) MSE and R-squared – these are common performance metrics used in linear models, b) Residual plot – this plot will show us if some of the assumptions of linear regression have been violated, and c) Learning curves – this plot will show us how well the model fits to the data and usually gives a good indication of over/under fitting...

Massive Tutorial on Image Processing And Preparation For Deep Learning in Python, #2
Manipulate and transform images at will...

Preparing for Google's Machine Learning Interview [YouTube Video]
In this video I share how I prepared for google's machine learning software engineer interview and the resources I found helpful...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍