Data Science Weekly Newsletter

Issue

422

December 23, 2021

‍

Editor's Picks

‍

ul style="font-size:18px;line-height:26px;font-family:tahoma,verdana,segoe,sans-serif;">

Spreadsheet Games: All Playable in Excel or Google Sheets
Everyone knows game designers love working with spreadsheets, but there aren't enough games that run *in* spreadsheets...But my students are helping set things right. Check out some of their amazing games, all playable in Excel or Google Sheets...

Ways I Use Testing as a Data Scientist
As a data scientist, I wear many different hats, which also made learning about testing difficult. There’s plenty of material on testing from a software development perspective, but if I’m doing an analysis and not developing software, I found many of those concepts difficult to translate and apply in my work...In that spirit, I thought I would write a blog post on the many ways I use testing in my work, in hopes that other data scientists will find it helpful when they’re trying to figure out what to test and how to test in the code they write...

To Understand Language is to Understand Generalization
Like the parable of the blind men and the elephant, computer scientists have come up with different abstract frameworks to describe what it would take to make our machines smarter...I’d like to throw in another take on the elephant: the aforementioned properties of generalization we seek can be understood as nothing more than the structure of human language. Before you think “ew, linguistics” and close this webpage, I promise that I’m not advocating for hard-coding formal grammars as inductive biases into our neural networks (see paragraph 1). To the contrary, I argue that considering generalization as being equivalent to language opens up exciting opportunities to scale up non-NLP models the way we have done for language...

‍

A Message From This Week's Sponsor

‍

High quality data labeling, consistently Edge cases are the most common challenges that ML teams face when training their AI models, making it difficult to reach 95+% accuracy. This can be more complex once you need to scale and start working with 3rd party data labeling solutions. The evaluation metrics that we use to measure the quality of labeled data - Intersection over Union (IOU) and F1 score - has allowed us to make swift adjustments on the go and continuously improve the quality of our labeling standards. To find out more and start exploring our end-to-end data labeling service, speak to the team at Supahands today.

‍

Data Science Articles & Videos

‍

Weisfeiler and Leman go Machine Learning: The Story so far
In recent years, algorithms and neural architectures based on the Weisfeiler-Leman algorithm, a well-known heuristic for the graph isomorphism problem, emerged as a powerful tool for machine learning with graphs and relational data. Here, we give a comprehensive overview of the algorithm's use in a machine learning setting, focusing on the supervised regime...Moreover, we give an overview of current applications and future directions to stimulate further research...

The Second Egress: Building a Code Change
This website is a tool to make sense of the wicked problem of the second egress in Canada and prepare a building code change...The first section documents the history of the building code and two means of egress in Canada, situates the problem of the second egress within the imperative of missing middle densification and calls upon architects to challenge the legislative conditions of their work. The next section compares jurisdictions to better understand the Canadian code relative to its peers, followed by the proposed code change. The third section reimagines what could and should be built if it were legal, and illustrating these architectural opportunities with a series of case studies in alternative circulation...

MCMC for big datasets -- faster sampling with JAX and the GPU
You’ll often hear people say that MCMC is too slow for big datasets. For the very biggest datasets with millions of observations, there may be some truth to that. But the developers of PyMC and Stan are constantly refining their samplers, and it’s now possible to fit models to much bigger datasets than you might think...But how much faster is MCMC with JAX, and with a GPU? This blog post explores this question on a single example. It’s limited, of course – maybe other models will see more or less of a gain – and, although I did my best to write code efficiently, things could probably be optimised further. Still, I hope you’ll agree that there are some interesting results...

What’s In Store for the Future of the Modern Data Stack?
A few weeks ago, I had the opportunity to chat with Bob Muglia, former CEO of Snowflake and one of the pioneers of the modern data stack, to learn about his predictions for the future of our industry...

The Mathematics of Linear Distortion
The mathematics of linear distortion only applies to linear and time invariant systems. Therefore, these systems and their translation to the frequency domain, where the mathematical analysis is simplified, are briefly summarized. Then it is discussed how the presented theory can be applied to real transmission media and/or electronic components. Finally, the mathematics of all possible cases of linear distortion are summarized in a table, and each case is explained individually...

Introducing Skippa - Scikit-learn Pre-processing Pipelines in Pandas
Skippa is a package designed to: a) drastically simplify development, b) package / serialize all data cleaning, pre-processing together with your model algorithm into a single pipeline file, c) reuse the interface/components from pandas & scikit-learn that you’re already familiar with, and more...Skippa helps you to easily define data cleaning & pre-processing operations on a pandas DataFrame and combine it with a scikit-learn model/algorithm into a single executable pipeline. It works roughly like this...

Programming as a Vehicle for Math
In March 2020, I gave a talk at Math for America, an organization that fosters professional development for K-12 math teachers in the New York City area. It was part of my __A Programmer's Introduction to Mathematics__ “book tour,”...The MfA organizers never posted my talk online, and at this point I’ve lost hope that they will (thanks, Covid). So I’ll recap the content of the talk, linking to my slides (click there for nice images and gifs) and the transcript I prepared in advance of that talk. This post will summarize the main ideas and provide some extra color...

Algorithmic Trading Models - Machine Learning
I’ve written 4 articles on theoretical concepts behind algorithmic trading models. The previous articles have covered breakouts, moving averages, oscillators and cyclical methods. The 5th model type, machine learning methods, is considerably more involved due to the scope of the topic and so this article is definitely not designed to be a white paper on the only way ML can be used in algorithmic trading. My goal in this article is to provide one framework that incorporates some form of computer learning to predict future prices of the GBP/USD rate. You can consider this part 1 of Algorithmic Trading Models — Machine Learning, because there’s a huge scope that can be covered in this topic that I wouldn’t be able to in one article and I will be writing more with alternate ideas in the future....

On Bayesian Geometry: Geometric interpretation of probability distributions
The idea behind Bayes Geometry is simple: what if we represent any function in the parameter space as a vector in a certain vector space. Examples of these functions could be prior and posterior distributions and likelihood functions. Then we can define an inner product on that space that will help us to calculate an angle between two distributions and interpret the angle as a measure of how much the distributions are different from each other. In my discussion on this subject I will follow a paper by de Carvalho et al...

How Should Organizations Structure their Data?
Since the rise of computing in the 90’s there have been heated debates between the best data structuring techniques. However, two have reigned supreme — the ideas of Bill Inmon and Ralf Kimball. Both define ETL pipelines that bring data from a variety of sources into the same location for access by stakeholders within the organization...However, in the early 2000’s, Dan Linstedt invented another data pipeline structure called a data vault...In this post we will review a comparison from a 2021 paper that outlines each method and explains the pros and cons of each. Please note that each topic is complex, so we only cover the very basics — more resources are linked throughout the post and in the comments...

‍

Tools

‍

Free Course: Natural Language Processing (NLP) for Semantic Search Learn how to build semantic search applications by making machines understand language as people do. This free course covers everything you need to build state-of-the-art language models, from machine translation to question-answering, and more. Brought to you by Pinecone. Start reading now. *Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Data Scientist, Decisions - Lyft - New York, NY Data Science is at the heart of Lyft’s products and decision-making. As a member of the Science team, you will work in a dynamic environment, where we embrace moving quickly to build the world’s best transportation. Data Scientists take on a variety of problems ranging from shaping critical business decisions to building algorithms that power our internal and external products. We’re looking for passionate, driven Data Scientists to take on some of the most interesting and impactful problems in ridesharing...

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

Relationship between SVD and PCA. How to use SVD to perform PCA?
Principal component analysis (PCA) is usually explained via an eigen-decomposition of the covariance matrix. However, it can also be performed via singular value decomposition (SVD) of the data matrix 𝐗. How does it work? What is the connection between these two approaches? What is the relationship between SVD and PCA?...Or in other words, how to use SVD of the data matrix to perform dimensionality reduction?...

Top YouTube Channels that you must follow if you're into Data Science : A Twitter thread🏻
A twitter thread with compilations of Data Science YouTube channels...

Implementing Naive Bayes From Scratch
In the following sections, we will implement the Naive Bayes Classifier from scratch in a step-by-step fashion using just Python and NumPy...But, before we get started coding, let’s talk briefly about the theoretical background and assumptions underlying the Naive Bayes Classifier...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍