Data Science Weekly Newsletter

Issue

420

December 9, 2021

‍

Editor's Picks

‍

D3 and Data Visualization Insights with Mike Bostock
What’s the secret for D3’s long-time success? Mike Bostock, the creator of D3 shares the reasons for his data visualization tool’s longevity, and why it won the 10-year Test-of-Time award from the IEEE. Mike goes deep on D3 and Observable, which he also founded, and talks about all things visualization with The Data Wranglers Joe Hellerstein and Jeffrey Heer, including when it’s OK to use a bar-chart for getting quick data insights and the applications of time zone wrangling...

A Call to Build Models Like We Build Open-Source Software
This post argues that we should develop tools that will allow us to build pre-trained models in the same way that we build open-source software. Specifically, models should be developed by a large community of stakeholders who continually update and improve them. Realizing this goal will require porting many ideas from open-source software development to building and training models, which motivates many threads of interesting research....

AI-DR Program Automated Decision-Making and the Law Clearinghouse Project
One public perception is that automated decision-making is fairer, or could even be more lawful. This perception stems from the belief that human bias may be eliminated in automated decisions. However, as emerging research has shown, unlawful discrimination can flow from the bias that remains encoded in automated decision-making systems...The aim of this clearinghouse project thus is to highlight seminal and impactful articles focused on issues of AI Decision-Making and the law. The AI-DR Program is pleased to share a searchable database of legal scholarly articles related to AI, automated decision-making and the law...

‍

A Message From This Week's Sponsor

‍

Retool is the fast way to build an interface for any database With Retool, you don't need to be a developer to quickly build an app or dashboard on top of any data set. Data teams at companies like NBC use Retool to build any interface on top of their data—whether it's a simple read-write visualization or a full-fledged ML workflow. Drag and drop UI components—like tables and charts—to create apps. At every step, you can jump into the code to define the SQL queries and JavaScript that power how your app acts and connects to data. The result—less time on repetitive work and more time to discover insights.

‍

Data Science Articles & Videos

‍

Learning with not Enough Data Part 1: Semi-Supervised Learning
The performance of supervised learning tasks improves with more high-quality labels available. However, it is expensive to collect a large number of labeled samples. There are several paradigms in machine learning to deal with the scenario when the labels are scarce. Semi-supervised learning is one candidate, utilizing a large amount of unlabeled data conjunction with a small amount of labeled data...

Automated Story Generation as Question-Answering
We propose a novel approach to automated story generation that treats the problem as one of generative question-answering. Our proposed story generation system starts with sentences encapsulating the final event of the story. The system then iteratively (1) analyzes the text describing the most recent event, (2) generates a question about "why" a character is doing the thing they are doing in the event, and then (3) attempts to generate another, preceding event that answers this question...

Cloud Wars: The Attack of Snowflakes
Erik Bern wrote a post last week, combining the counterintuitive ideas that (a) the lowest cloud infrastructure layers are not commodity services, and (b) this means that the cloud providers could be happy ceding ground to others for higher level services, turning into pure play infrastructure platforms....I’m in violent agreement with the first premise that the lowest cloud infra layers are not commodity services¹. But I think it’s unlikely that cloud providers would be happy ceding ground to others on higher level services...

Visualize Data on Spirals
In this vignette, I describe the package spiralize which visualizes data along an Archimedean spiral. It has two major advantages for visualization: a) It is able to visualize data with very long axis with high resolution and b) It is efficient for time series data to reveal periodic patterns...

Language Modelling at Scale: Gopher, Ethical considerations, and Retrieval
Today we [DeepMind] are releasing three papers on language models that reflect this interdisciplinary approach. They include a detailed study of a 280 billion parameter transformer language model called Gopher, a study of ethical and social risks associated with large language models, and a paper investigating a new architecture with better training efficiency...

Updated spaCY NLP Course
We've updated our interactive NLP course for spaCy v3!... The updated course is available in English, Spanish, German and Japanese... 4 interactive chapters: from the first steps to your own spaCy model... New exercises about the training CLI & config...

A Cartel of Influential Datasets Is Dominating Machine Learning Research, New Study Suggests
A new paper from the University of California and Google Research has found that a small number of ‘benchmark’ machine learning datasets, largely from influential western institutions, and frequently from government organizations, are increasingly dominating the AI research sector...the authors contend that ‘widely-used datasets are introduced by only a handful of elite institutions’, and that this ‘consolidation’ has increased to 80% in recent years...

PyTorch: Where we are headed and why it looks a lot like Julia (but not exactly like Julia)
When trying to predict how PyTorch would itself get disrupted, we used to joke a bit about the next version of PyTorch being written in Julia. This was not very serious: a huge factor in moving PyTorch from Lua to Python was to tap into Python’s immense ecosystem (an ecosystem that shows no signs of going away) and even today it is still hard to imagine how a new language can overcome the network effects of Python...However, recently, I have been thinking about various projects we have going on in PyTorch...

minitorch
MiniTorch is a diy teaching library for machine learning engineers who wish to learn about the internal concepts underlying deep learning systems. It is a pure Python re-implementation of the Torch API designed to be simple, easy-to-read, tested, and incremental. The final library can run Torch code. The project was developed for the course 'Machine Learning Engineering' at Cornell Tech...

Building a recommendation engine inside Postgres with Python and Pandas
Earlier today I was starting to wonder why couldn't I do more machine learning directly inside the Postgres database. Yeah, there is madlib, but what if I wanted to write my own recommendation engine? So I set out on a total detour of a few hours and lo and behold, I can probably do a lot more of this in Postgres than I realized before. What follows is a quick walkthrough of getting a recommendation engine setup directly inside Postgres on top of Crunchy Bridge, our database as a service...

‍

Tools

‍

What's a vector database, and how can you use it for AI/ML applications? Vector databases help data scientists and ML engineers implement NLP into search, personalization, security, analytics, and monitoring applications. Learn all about them, their use cases, their core components, and how to get started. (It's easy.) Start here: What is a vector database? *Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

R&D Data Scientist - Danaher - Port Washington, NY As a Data Scientist at IBM, you will help transform our clients’ data into tangible business value by analyzing information, communicating outcomes and collaborating on product development. Work with Best in Class open source and visual tools, along with the most flexible and scalable deployment options. Whether it’s investigating patient trends or weather patterns, you will work to solve real world problems for the industries transforming how we live.

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

Intuitive Bayes Introductory Course
Have you found most statistics books overly theoretical? Math-heavy? Or lacking a clear focus on application?...Want to keep your skills sharp to improve your career prospects?...Have you heard about these new fangled Probabilistic Programming Languages and want to know what they're all about?...Then this course is for you...

How a Kalman filter works, in pictures
You can use a Kalman filter in any place where you have uncertain information about some dynamic system, and you can make an educated guess about what the system is going to do next. Even if messy reality comes along and interferes with the clean motion you guessed about, the Kalman filter will often do a very good job of figuring out what actually happened. And it can take advantage of correlations between crazy phenomena that you maybe wouldn’t have thought to exploit!...I’ll start with a loose example of the kind of thing a Kalman filter can solve, but if you want to get right to the shiny pictures and math, feel free to jump ahead...

Reddit Discussion: Why are Einstein Sum Notations not popular in ML? They changed my life.
I recently discovered `torch.einsum` and now I am mad at every friend, mentor, acquaintance for not telling me about it...They are just way more intuitive and can handle most operations that I would want to do with tensors so elegantly...It takes only 30 mins or so to learn the notation and become somewhat proficient but then you are sorted for life...What are the arguments for and against using einstein notations for everything?...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍