Data Science Weekly Newsletter

Issue

401

July 29, 2021

‍

Editor's Picks

‍

Generally capable agents emerge from open-ended play
Today, we published "Open-Ended Learning Leads to Generally Capable Agents," a preprint detailing our first steps to train an agent capable of playing many different games without needing human interaction data...The result is an agent with the ability to succeed at a wide spectrum of tasks — from simple object-finding problems to complex games like hide and seek and capture the flag, which were not encountered during training. We find the agent exhibits general, heuristic behaviours such as experimentation, behaviours that are widely applicable to many tasks rather than specialised to an individual task. This new approach marks an important step toward creating more general agents with the flexibility to adapt rapidly within constantly changing environments...

Tour of the Sacred Library
Art +AI project where the author presents a series of short paragraphs and uses CLIP + VQGAN to synthesize images that support the narrative (walking through a library)...

What time-weighted averages are and why you should care
Many people who work with time-series data have nice, regularly sampled datasets. Data could be sampled every few seconds, or milliseconds, or whatever they choose, but by regularly sampled, we mean the time between data points is basically constant. Computing the average value of data points over a specified time period in a regular dataset is a relatively well-understood query to compose. But for those who don't have regularly sampled data, getting a representative average over a period of time can be a complex and time-consuming query to write. Time-weighted averages are a way to get an unbiased average when you are working with irregularly sampled data...

‍

A Message From This Week's Sponsor

‍

The Vector Database Pinecone is a fully managed vector database that makes it easy to add vector similarity search to production applications. It combines state-of-the-art vector search libraries, advanced features such as live index updates, and distributed infrastructure to provide high performance and reliability at any scale. No more hassles of benchmarking and tuning algorithms or building and maintaining infrastructure for vector search. Advanced ML teams use vector search to drastically improve results for semantic text search, image/audio search, recommendation systems, feed ranking, abuse/fraud detection, deduplication, and other applications. 3 reasons to try Pinecone:

It's production-ready: Go to production with a few lines of code, without breaking a sweat or slowing down.
It's scalable and high-performing: Search through billions of vectors in tens of milliseconds.
It's fully managed: We obsess over operations and security so you don't have to.

Try Pinecone now for free → PS — Get a free t-shirt after you run your first query!

‍

Data Science Articles & Videos

‍

Triton: Open-Source GPU Programming for Neural Networks
We’re releasing Triton 1.0, an open-source Python-like programming language which enables researchers with no CUDA experience to write highly efficient GPU code—most of the time on par with what an expert would be able to produce. Triton makes it possible to reach peak hardware performance with relatively little effort; for example, it can be used to write FP16 matrix multiplication kernels that match the performance of cuBLAS—something that many GPU programmers can’t do—in under 25 lines of code. Our researchers have already used it to produce kernels that are up to 2x more efficient than equivalent Torch implementations, and we’re excited to work with the community to make GPU programming more accessible to everyone...

What is the right level of specialization? For data teams and anyone else
This isn't as much of a blog post as an elaboration of a tweet I posted the other day..."I think this specialization of data teams into 99 different roles (data scientist, data engineer, analytics engineer, ML engineer etc) is generally a bad thing driven by the fact that tools are bad and too hard to use"...This seem to have resonated with a lot of people, but for whatever reason, it ended up being a lot more polarizing than I thought! There was a fair amount of misunderstanding of what I meant, so I just wanted to expand this into a slightly longer argument...

Building intuition for p-values and statistical significance
This is the transcript of a talk I did on experimentation and A/B testing, to give the audience an intuitive understanding of p-values and statistical significance...The format of the talk was a short introduction, and then typing and showing an ipython notebook. The initial, short snippets I typed out to make it interesting, for the later parts I switched to an existing notebook to keep the talk's momentum going...I open by saying in experimentationt Data Scientists have 3 jobs: a) Design, b) Run, c) Evaluate... experiments. This talk is about the evaluation phase, and can be summed up as "don’t get fooled by randomness"...

Mapping Africa’s Buildings with Satellite Imagery
Because satellite imaging involves photographing the earth from several hundred kilometres above the ground, even at high resolution (30–50 cm per pixel), a small building or tent shelter occupies only a few pixels. The task is even more difficult for informal settlements, or rural areas where buildings constructed with natural materials can visually blend into the surroundings...In “Continental-Scale Building Detection from High-Resolution Satellite Imagery”, we address these challenges, using new methods for detecting buildings that work in rural and urban settings across different terrains, such as savannah, desert, and forest, as well as informal settlements and refugee facilities. We use this building detection model to create the Open Buildings dataset, a new open-access data resource containing the locations and footprints of 516 million buildings with coverage across most of the African continent...

Understanding the World Through Action [Video]
Extended version of a talk on self-supervised reinforcement learning prepared for the ICML Self-Supervised Learning Workshop, 2021...

Solving Mixed Integer Programs Using Neural Networks
Mixed Integer Programming (MIP) solvers rely on an array of sophisticated heuristics developed with decades of research to solve large-scale MIP instances encountered in practice. Machine learning offers to automatically construct better heuristics from data by exploiting shared structure among instances in the data. This paper applies learning to the two key sub-tasks of a MIP solver, generating a high-quality joint variable assignment, and bounding the gap in objective value between that assignment and an optimal one...

Visualizing Autoencoders with Tensorflow.js
An autoencoder is a type of neural network that is comprised of two functions: an encoder that projects data from high to low dimensionality, and a decoder that projects data from low to high dimensionality. To understand how these two functions work, let’s consider the following images...

Incremental Development of PyMC Models
PyMC is a powerful tool for doing Bayesian statistics, but getting started can be intimidating. This article presents an example that I think is a good starting place, and demonstrates a method I use to develop and test models incrementally...

What Have Language Models Learned?
Large language models are making it possible for computers to write stories, program a website and turn captions into images. One of the first of these models, BERT, is trained by taking sentences, splitting them into individual words, randomly hiding some of them, and predicting what the hidden words are. After doing this millions of times, BERT has “read” enough Shakespeare to predict how phrases usually end...This page is hooked up to a version of BERT trained on Wikipedia and books.¹ Try clicking on different words to see how they’d be filled in or typing in another sentence to see what else has BERT picked up on...

A Bayesian Puzzle: The Left-Handed Sister Problem
Suppose you meet someone who looks like the brother of your friend Mary. You ask if he has a sister named Mary, and he says “Yes I do, but I don’t think I know you.”...You remember that Mary has a sister who is left-handed, but you don’t remember her name. So you ask your new friend if he has another sister who is left-handed...If he does, how much evidence does that provide that he is the brother of your friend, rather than a random person who coincidentally has a sister named Mary and another sister who is left-handed? In other words, what is the Bayes factor of the left-handed sister?...

‍

Tools

‍

Retool is the fastest way to build internal tools. As developers, we realized that all internal tools are made up of the same building blocks: tables, drop-downs, buttons, text inputs, etc. So, we built a drag-and-drop interface that makes it super easy to build internal tool UIs. All with prebuilt database connectors, and the ability to customize every aspect of code with JavaScript. Companies like DoorDash, Amazon and Brex use Retool to build internal tools super fast. Don't waste hours searching for React components and wrangling data sources and APIs! Try Retool instead. *Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Senior Data Scientist - WarnerMedia - New York, NY

WarnerMedia is a leading media and entertainment company that creates and distributes premium and popular content from a diverse array of talented storytellers and journalists to global audiences through its consumer brands including: HBO, HBO Max, Warner Bros., TNT, TBS, truTV, CNN, DC Entertainment, New Line, Cartoon Network, Adult Swim, Turner Classic Movies and others.

Reporting to the Sr. Manager, Data Science this role will help to develop the predictive insights and prescriptive capabilities behind CNN’s emerging products, transforming first- and third- party data into quantitative findings, visualizations, and automation

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

How to Convert Static Pandas Plot (Matplotlib) to Interactive?
Pandas also provides plotting functionality but all of the plots are static plots...what if you want your plots to be interactive?...We have designed this tutorial on how to continue to use the existing pandas interface for plotting interactive graphs. We'll introduce a library called hvplot which provides a wrapper around pandas so that it can make use of an interactive plotting library called holoviews for plotting. We'll be explaining a few examples on how to use hvplot to generate interactive graphs...

PhD Thesis: Deep Learning For Medical Image Interpretation [PDF]
In this thesis, I describe three key directions that present challenges and opportunities for the development of deep learning technologies for medical image interpretation. First, I discuss the development of algorithms for expert-level medical image interpretation...Second, I discuss the design and curation of high-quality datasets and their roles in advancing algorithmic developments...Third, I discuss the real-world evaluation of medical image algorithms with studies systematically analyzing performance under clinically relevant distribution shifts...

PhD Thesis: Learned Feedback & Feedforward Perception & Control
The notions of feedback and feedforward information processing gained prominence under cybernetics...Negative feedback processing corrects errors, whereas feedforward processing makes predictions, thereby preemptively reducing errors...This thesis draws on feedback and feedforward ideas developed within predictive coding, adapting them to improve machine learning techniques for perception (Part II) and control (Part III). Upon establishing these conceptual connections, in Part IV, we traverse this bridge, from machine learning back to neuroscience, arriving at new perspectives on the correspondences between these fields...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍