Receive the Data Science Weekly Newsletter every Thursday
Easy to unsubscribe at any time. Your e-mail address is safe.
Data Science Weekly Newsletter
August 12, 2021
How to avoid machine learning pitfalls: a guide for academic researchers
This document gives a concise outline of some of the common mistakes that occur when using machine learning techniques, and what can be done to avoid them. It is intended primarily as a guide for research students, and focuses on issues that are of particular concern within academic research, such as the need to do rigorous comparisons and reach valid conclusions. It covers five stages of the machine learning process: what to do before model building, how to reliably build models, how to robustly evaluate models, how to compare models fairly, and how to report results...
Machine Learning Won't Solve Natural Language Understanding
We argue that it is time to re-think our approach to NLU work since we are convinced that the ‘big data’ approach to NLU is not only psychologically, cognitively, and even computationally implausible, but, and as we will show here, this blind data-driven approach to NLU is also theoretically and technically flawed...
A Message From This Week's Sponsor
The Vector Database
Pinecone is a fully managed vector database that makes it easy to add vector similarity search to production applications. It combines state-of-the-art vector search libraries, advanced features such as live index updates, and distributed infrastructure to provide high performance and reliability at any scale. No more hassles of benchmarking and tuning algorithms or building and maintaining infrastructure for vector search.
Advanced ML teams use vector search to drastically improve results for semantic text search, image/audio search, recommendation systems, feed ranking, abuse/fraud detection, deduplication, and other applications.
3 reasons to try Pinecone:
It's production-ready: Go to production with a few lines of code, without breaking a sweat or slowing down.
It's scalable and high-performing: Search through billions of vectors in tens of milliseconds.
It's fully managed: We obsess over operations and security so you don't have to.
ART MACHINE: Put in text, get AI art.
This notebook is by Hillel Wayne. It's based on this notebook by Katherine Crowson, simplified to make it more accessible to nonprogrammers. The original technique was discovered by https://twitter.com/advadnoun....
A Dataset Exploration Case Study with Know Your Data
Today, we demonstrate some of the functionality of a dataset exploration tool, Know Your Data (KYD), recently introduced at Google I/O, using the COCO Captions dataset as a case study. Using this tool, we find a range of gender and age biases in COCO Captions — biases that can be traced to both dataset collection and annotation practices...
Mitigating dataset harms requires stewardship: Lessons from 1000 papers
Concerns about privacy, bias, and harmful applications have shone a light on the ethics of machine learning datasets, even leading to the retraction of prominent datasets including DukeMTMC, MS-Celeb-1M, TinyImages, and VGGFace2. In response, the machine learning community has called for higher ethical standards, transparency efforts, and technical fixes in the dataset creation process. The premise of our work is that these efforts can be more effective if informed by an understanding of how datasets are used in practice in the research community...
What is the right level of specialization? For data teams and anyone else.
I think this specialization of data teams into 99 different roles (data scientist, data engineer, analytics engineer, ML engineer etc) is generally a bad thing driven by the fact that tools are bad and too hard to use. This seem to have resonated with a lot of people, but for whatever reason, it ended up being a lot more polarizing than I thought! There was a fair amount of misunderstanding of what I meant, so I just wanted to expand this into a slightly longer argument...
Announcing AI21 Studio and Jurassic-1 Language Models
We are thrilled to announce the launch of AI21 Studio, our new developer platform where you can use our state-of-the-art Jurassic-1 language models to build your own applications and services. Jurassic-1 models come in two sizes, where the Jumbo version, at 178B parameters, is the largest and most sophisticated language model ever released for general use by developers. AI21 Studio is currently in open beta, allowing anyone to sign up and immediately start querying Jurassic-1 using our API and interactive web environment...
Querying an SQL Database with SQL Alchemy
Though this article has nothing to do with actual alchemy, there are redeemable qualities in this transformative pursuit. We will use this as motivation when describing the power of SQL Alchemy and its ability to turn ‘common’ python code into ‘noble’ sql queries...
SDEdit: Image Synthesis and Editing with Stochastic Differential Equations
We introduce a new image editing and synthesis framework, Stochastic Differential Editing (SDEdit), based on arecent generative model using stochastic differential equations (SDEs). Given an input image with user edits (e.g.,hand-drawn color strokes), we first add noise to the input according to an SDE, and subsequently denoise it by simulating the reverse SDE to gradually increase its likelihood under the prior...
Face Mask Detection with Deep Learning and Computer Vision
Although some states have lifted mask requirements, it is still mandated at indoor public places like airports and hospitals. But it’s difficult and inefficient to inspect the large crowd with labor screening. So, the goal of this project is to build a face mask detection system using deep learning algorithms and computer vision to let machine help with inspection...
Open Audio and Video Datasets
During conversations with clients, we often get asked if there are any off-the-shelf audio and video datasets we would recommend, for testing and for them to use as a point of comparison with custom approaches...When we started searching for lists of datasets it was very surprising how limited they were...To address this, we have put together a list of 100+ open audio and video datasets. The datasets listed below all contain the number of recordings in each dataset, the number of participants involved, the languages of the speech content, the file size, and file type...
Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)
Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate
Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!
We are looking for a Senior Data Analyst to help us re-develop our existing data workflow, enable better scalability, and improve accuracy. In addition to this, we’re looking for someone to help improve our ability to discover the relevant information in our data, driving our decisions in delivering an ever improving service.
The primary focus of the role will be in establishing a new data gathering pipeline, doing statistical analysis, and helping build the analytical basis for the prediction systems. This is the perfect opportunity to be intricately involved in running analytical experiments in a methodical manner, and give us a hand in improving the next generation of recommendation systems that power our social experience.
Want to post a job here? Email us for details >> firstname.lastname@example.org
Training & Resources
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
As part of the African Master’s in Machine Intelligence (AMMI 2021), we have delivered a course on Geometric Deep Learing (GDL100), which closely follows the contents of our GDL proto-book. We make all materials and artefacts from this course publicly available, as companion material for our proto-book, as well as a way to dive deeper into some of the contents for future iterations of the book....
Introduction to Deep Learning [170 Video lectures]
170 Video Lectures from Adaptive Linear Neurons to Zero-shot Classification with Transformers...I just sat down this morning and organized all deep learning related videos I recorded in 2021. I am sure this will be a useful reference for my future self, but I am also hoping it might be useful for one or the other person out there...PS: All code examples are in PyTorch :)...
Four Deep Learning Papers to Read in August 2021
Welcome to the August edition of the "Machine-Learning-Collage" series, where I provide an overview of the different Deep Learning research streams. So what is a ML collage? Simply put, I draft one-slide visual summaries of one of my favourite recent papers. Every single week. At the end of the month all of the resulting visual collages are collected in a summary blog post. Thereby, I hope to give you a visual and intuitive deep dive into some of the coolest trends. So without further ado: Here are my four favourite papers that I read in July 2021 and why I believe them to be important for the future of Deep Learning...