Receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe at any time. Your e-mail address is safe.

Data Science Weekly Newsletter
Issue
431
February 24, 2022

Editor's Picks

  • A Gentle Introduction to Vector Databases
    In this blog post, I’ll introduce concepts related to the vector database, a new type of technology designed to store, manage, and search embedding vectors. Vector databases are being used in an increasingly large number of applications, including but not limited to image search, recommender system, text understanding, video summarization, drug discovery, stock market analysis, and much more...



A Message From This Week's Sponsor



Retool is the fast way to build an interface for any database With Retool, you don't need to be a developer to quickly build an app or dashboard on top of any data set. Data teams at companies like NBC use Retool to build any interface on top of their data—whether it's a simple read-write visualization or a full-fledged ML workflow. Drag and drop UI components—like tables and charts—to create apps. At every step, you can jump into the code to define the SQL queries and JavaScript that power how your app acts and connects to data. The result—less time on repetitive work and more time to discover insights.


Data Science Articles & Videos

  • One Voice Detector to Rule Them All
    In this article we will tell you about Voice Activity Detection in general, describe our approach to VAD metrics, and show how to use our VAD and test it on your own voice...
  • Tools and Recommendations for Reproducible Teaching
    It is recommended that teacher-scholars of data science adopt reproducible workflows in their research as scholars and teach reproducible workflows to their students. In this paper, we propose a third dimension to reproducibility practices and recommend that regardless of whether they teach reproducibility in their courses or not, data science instructors adopt reproducible workflows for their own teaching. We consider computational reproducibility, documentation, and openness as three pillars of reproducible teaching framework. We share tools, examples, and recommendations for the three pillars...
  • Beyond Precision: Expressiveness in Visualization
    In recent years, I have grown increasingly dissatisfied with the way we teach and talk about data visualization – at least from what I observe in academic settings. In particular, I am concerned with the predominant paradigm that visualization can and should be designed according to how precisely a given visual encoding can represent data. The story we tell ourselves (and the same story I tell with increasing discomfort to my students) goes a little like this...
  • An introduction to the deceit of statistical significance without p-values
    A recent Twitter quiz asked “what is a powerful concept from your field that, if more people understood it, their lives would be better?” Unambiguously, the answer from my field is statistical significance...Here, I’ll explain in as plain terms as I can what statistical significance means in almost every published scientific study. I’ll do this without ever defining a p-value, as p-values have nothing to do with the way significance testing is used. Instead, significance testing amounts to hand wavy arguments about precision and variability. Laying it out this way shows why the authority granted to significance testing is so suspect and unearned...
  • Transfer Learning on Greyscale Images: How to Fine-Tune Pretrained Models on Black-and-White Datasets
    In this article, we shall attempt to demystify all of the considerations needed when finetuning with black-and-white images by exploring the difference between RGB and greyscale images, and how these formats affect the processing operations done by convolutional neural network models, before demonstrating how to use greyscale images with pretrained models. We shall finish by examining the performance of the different approaches explored on some open source datasets and compare this to training from scratch on greyscale images...
  • Graph Theory and Linear Algebra
    Graphs are an incredibly versatile structure insofar as they can model everything from the modernity of computer science and complexity of geography, to the intricacy of linguistic relationships and the universality of chemical structures...This paper explores the relationships between graph theory, their associated matrix representations, and the matrix properties found in linear algebra...In order to achieve this goal, this paper presents some of the most interesting theorems regarding matrix representations of graphs, and ties these theorems back to questions in graph theory itself....
  • An Introduction to Neural Data Compression
    Neural compression is the application of neural networks and other machine learning methods to data compression. While machine learning deals with many concepts closely related to compression, entering the field of neural compression can be difficult due to its reliance on information theory, perceptual metrics, and other knowledge specific to the field. This introduction hopes to fill in the necessary background by reviewing basic coding topics such as entropy coding and rate-distortion theory, related machine learning ideas such as bits-back coding and perceptual metrics, and providing a guide through the representative works in the literature so far...
  • What are the Most Important Statistical Ideas of the Past 50 Years?
    We review the most important statistical ideas of the past half century, which we categorize as: counterfactual causal inference, bootstrapping and simulation-based inference, overparameterized models and regularization, Bayesian multilevel models, generic computation algorithms, adaptive decision analysis, robust inference, and exploratory data analysis. We discuss key contributions in these subfields, how they relate to modern computing and big data, and how they might be developed and extended in future decades. The goal of this article is to provoke thought and discussion regarding the larger themes of research in statistics and data science...
  • DeepMind - The Podcast, Episode: Me, myself and AI
    AI doesn’t just exist in the lab, it’s already solving a range of problems in the real world. In this episode, Hannah encounters a realistic recreation of her voice by WaveNet, the voice synthesising system that powers the Google Assistant and helps people with speech difficulties and illnesses regain their voices. Hannah also discovers how ‘deepfake’ technology can be used to improve weather forecasting and how DeepMind researchers are collaborating with Liverpool Football Club, aiming to take sports to the next level...
  • Dive into Deep Learning Compilers
    This project is for readers who are interested in high-performance implementation of their programs utilizing deep learning techniques...In the first part, we will introduce how to implement and optimize operators, such as matrix multiplication and convolution, for various hardware platforms...In the second part, we will show how to convert neural network models from various deep learning frameworks and further optimize them in the program level. The last part we will address how to deploy the optimized program into various environment such as mobile phones...In addition, at the end of the book, we plan to cover some latest advance of the deep learning compiler domain...
  • Things that upset you as a Data scientist [Reddit Discussion]
    I have been a Data scientist since seven years. There are several challenges we face everyday and Till this day, something that absolutely upsets me is not having a single good IDE for prototyping and production development. I constantly see myself switching between Jupyterlab and VScode and it's really annoying!...Anyways, I just want to hear what are the other biggest pain points you face as a Data scientist in your everyday work that absolutely upset you!...



Forum

Check out the new Anaconda Community for all-things data! Want insights into the newest developments in the world of data, or need help getting “unstuck” on a problem? Our Community Forums is the place to go! Be the first to engage with other professionals and ask questions to the broader data community. Users can join in conversations around trends, debate new features, post questions to the community, and more. Plus, it’s another avenue for technical help! Create your free Anaconda Community account now.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!



Jobs


Training & Resources

  • Linear & Polynomial Regression: Exploring Some Red Flags For Models That Underfit
    The purpose of this project is to observe some of the red flags for a model that is severely underfitting to the data and how these red flags change when fitting a more appropriate model...The red flags that I’ll be considering are: a) MSE and R-squared – these are common performance metrics used in linear models, b) Residual plot – this plot will show us if some of the assumptions of linear regression have been violated, and c) Learning curves – this plot will show us how well the model fits to the data and usually gives a good indication of over/under fitting...


Books

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

Receive the Data Science Weekly Newsletter every Thursday

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Easy to unsubscribe at any time. Your e-mail address is safe.