Receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe at any time. Your e-mail address is safe.

Data Science Weekly Newsletter
Issue
420
December 9, 2021

Editor's Picks

  • D3 and Data Visualization Insights with Mike Bostock
    What’s the secret for D3’s long-time success? Mike Bostock, the creator of D3 shares the reasons for his data visualization tool’s longevity, and why it won the 10-year Test-of-Time award from the IEEE. Mike goes deep on D3 and Observable, which he also founded, and talks about all things visualization with The Data Wranglers Joe Hellerstein and Jeffrey Heer, including when it’s OK to use a bar-chart for getting quick data insights and the applications of time zone wrangling...
  • A Call to Build Models Like We Build Open-Source Software
    This post argues that we should develop tools that will allow us to build pre-trained models in the same way that we build open-source software. Specifically, models should be developed by a large community of stakeholders who continually update and improve them. Realizing this goal will require porting many ideas from open-source software development to building and training models, which motivates many threads of interesting research....
  • AI-DR Program Automated Decision-Making and the Law Clearinghouse Project
    One public perception is that automated decision-making is fairer, or could even be more lawful. This perception stems from the belief that human bias may be eliminated in automated decisions. However, as emerging research has shown, unlawful discrimination can flow from the bias that remains encoded in automated decision-making systems...The aim of this clearinghouse project thus is to highlight seminal and impactful articles focused on issues of AI Decision-Making and the law. The AI-DR Program is pleased to share a searchable database of legal scholarly articles related to AI, automated decision-making and the law...



A Message From This Week's Sponsor

Retool is the fast way to build an interface for any database With Retool, you don't need to be a developer to quickly build an app or dashboard on top of any data set. Data teams at companies like NBC use Retool to build any interface on top of their data—whether it's a simple read-write visualization or a full-fledged ML workflow. Drag and drop UI components—like tables and charts—to create apps. At every step, you can jump into the code to define the SQL queries and JavaScript that power how your app acts and connects to data. The result—less time on repetitive work and more time to discover insights.


Data Science Articles & Videos

  • Learning with not Enough Data Part 1: Semi-Supervised Learning
    The performance of supervised learning tasks improves with more high-quality labels available. However, it is expensive to collect a large number of labeled samples. There are several paradigms in machine learning to deal with the scenario when the labels are scarce. Semi-supervised learning is one candidate, utilizing a large amount of unlabeled data conjunction with a small amount of labeled data...
  • Automated Story Generation as Question-Answering
    We propose a novel approach to automated story generation that treats the problem as one of generative question-answering. Our proposed story generation system starts with sentences encapsulating the final event of the story. The system then iteratively (1) analyzes the text describing the most recent event, (2) generates a question about "why" a character is doing the thing they are doing in the event, and then (3) attempts to generate another, preceding event that answers this question...
  • Cloud Wars: The Attack of Snowflakes
    Erik Bern wrote a post last week, combining the counterintuitive ideas that (a) the lowest cloud infrastructure layers are not commodity services, and (b) this means that the cloud providers could be happy ceding ground to others for higher level services, turning into pure play infrastructure platforms....I’m in violent agreement with the first premise that the lowest cloud infra layers are not commodity services¹. But I think it’s unlikely that cloud providers would be happy ceding ground to others on higher level services...
  • Visualize Data on Spirals
    In this vignette, I describe the package spiralize which visualizes data along an Archimedean spiral. It has two major advantages for visualization: a) It is able to visualize data with very long axis with high resolution and b) It is efficient for time series data to reveal periodic patterns...
  • Language Modelling at Scale: Gopher, Ethical considerations, and Retrieval
    Today we [DeepMind] are releasing three papers on language models that reflect this interdisciplinary approach. They include a detailed study of a 280 billion parameter transformer language model called Gopher, a study of ethical and social risks associated with large language models, and a paper investigating a new architecture with better training efficiency...
  • Updated spaCY NLP Course
    We've updated our interactive NLP course for spaCy v3!... The updated course is available in English, Spanish, German and Japanese... 4 interactive chapters: from the first steps to your own spaCy model... New exercises about the training CLI & config...
  • A Cartel of Influential Datasets Is Dominating Machine Learning Research, New Study Suggests
    A new paper from the University of California and Google Research has found that a small number of ‘benchmark’ machine learning datasets, largely from influential western institutions, and frequently from government organizations, are increasingly dominating the AI research sector...the authors contend that ‘widely-used datasets are introduced by only a handful of elite institutions’, and that this ‘consolidation’ has increased to 80% in recent years...
  • PyTorch: Where we are headed and why it looks a lot like Julia (but not exactly like Julia)
    When trying to predict how PyTorch would itself get disrupted, we used to joke a bit about the next version of PyTorch being written in Julia. This was not very serious: a huge factor in moving PyTorch from Lua to Python was to tap into Python’s immense ecosystem (an ecosystem that shows no signs of going away) and even today it is still hard to imagine how a new language can overcome the network effects of Python...However, recently, I have been thinking about various projects we have going on in PyTorch...
  • minitorch
    MiniTorch is a diy teaching library for machine learning engineers who wish to learn about the internal concepts underlying deep learning systems. It is a pure Python re-implementation of the Torch API designed to be simple, easy-to-read, tested, and incremental. The final library can run Torch code. The project was developed for the course 'Machine Learning Engineering' at Cornell Tech...
  • Building a recommendation engine inside Postgres with Python and Pandas
    Earlier today I was starting to wonder why couldn't I do more machine learning directly inside the Postgres database. Yeah, there is madlib, but what if I wanted to write my own recommendation engine? So I set out on a total detour of a few hours and lo and behold, I can probably do a lot more of this in Postgres than I realized before. What follows is a quick walkthrough of getting a recommendation engine setup directly inside Postgres on top of Crunchy Bridge, our database as a service...



Tools


What's a vector database, and how can you use it for AI/ML applications? Vector databases help data scientists and ML engineers implement NLP into search, personalization, security, analytics, and monitoring applications. Learn all about them, their use cases, their core components, and how to get started. (It's easy.) Start here: What is a vector database? *Sponsored post. If you want to be featured here, or as our main sponsor, contact us!



Jobs


Training & Resources

  • Intuitive Bayes Introductory Course
    Have you found most statistics books overly theoretical? Math-heavy? Or lacking a clear focus on application?...Want to keep your skills sharp to improve your career prospects?...Have you heard about these new fangled Probabilistic Programming Languages and want to know what they're all about?...Then this course is for you...
  • How a Kalman filter works, in pictures
    You can use a Kalman filter in any place where you have uncertain information about some dynamic system, and you can make an educated guess about what the system is going to do next. Even if messy reality comes along and interferes with the clean motion you guessed about, the Kalman filter will often do a very good job of figuring out what actually happened. And it can take advantage of correlations between crazy phenomena that you maybe wouldn’t have thought to exploit!...I’ll start with a loose example of the kind of thing a Kalman filter can solve, but if you want to get right to the shiny pictures and math, feel free to jump ahead...
  • Reddit Discussion: Why are Einstein Sum Notations not popular in ML? They changed my life.
    I recently discovered `torch.einsum` and now I am mad at every friend, mentor, acquaintance for not telling me about it...They are just way more intuitive and can handle most operations that I would want to do with tensors so elegantly...It takes only 30 mins or so to learn the notation and become somewhat proficient but then you are sorted for life...What are the arguments for and against using einstein notations for everything?...


Books

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

Easy to unsubscribe at any time. Your e-mail address is safe.