Receive the Data Science Weekly Newsletter every Thursday
Easy to unsubscribe at any time. Your e-mail address is safe.
Data Science Weekly Newsletter
August 20, 2020
Data Project Checklist
from Jeremy Howard of Fast.AI
There’s a lot more to creating useful data projects than just training an accurate model! When I used to do consulting, I’d always seek to understand an organization’s context for developing data projects, based on these considerations: Strategy, Data, Analytics, Implementation, Maintenance, and Constraints...I developed a questionnaire that I had clients fill out before a project started, and then throughout the project I’d help them refine their answers. This questionnaire is based on decades of projects across many industries, including agriculture, mining, banking, brewing, telecoms, retail, and more. Here I am sharing it publicly for the first time...
A Very Unlikely Chess Game
Last month, I asked Gwern Branwen if he thought GPT-2 could play chess. [Editor Note: GPT-2 is a language model that can write essays to a prompt, answer questions, and summarize longer works] I wondered if Gwern could train it on a corpus of chess games written in standard notation (where, for example, e2e4 means “move the pawn at square e2 to square e4”). There are literally millions of games written up like this. GPT-2 would learn to predict the next string of text, which would correspond to the next move in the chess game. Then you would prompt it with a chessboard up to a certain point, and it would predict how the chess masters who had produced its training data would continue the game – ie make its next move using the same heuristics they would...
A bird’s-eye view of modern AI from NeurIPS 2019
This year, I had a chance to attend NeurIPS, the most prominent conference in artificial intelligence and machine learning (AI/ML), to present a workshop paper...Here, I’ve collected some of my impressions in the hopes that they might be useful to others...The most overarching theme I noticed at NeurIPS was the maturation of deep learning as a set of techniques...I’ll break down my impressions into three general areas: making models more robust and generalizable for the real world, making models more efficient, and interesting and emerging applications...
Vettery is an online hiring marketplace that's changing the way people hire and get hired. Ready for a bold career move? Make a free profile, name your salary, and connect with hiring managers from top employers today.
Data Science Articles & Videos
Over the last ten years, many companies have created human-in-the-loop services that combine a mix of humans and algorithms. Now that some time has passed, we can tease out some patterns from their collective successes and failures. As someone who started a company in this space, my hope is that this retrospective can help prospective founders, investors, or companies navigating this space save time and fund more impactful projects...
My Experience as a Freelance Data Scientist in 2015
Every so often, data scientists who are thinking about going off on their own will email me with questions about my year of freelancing (2015). In my most recent response, I was a little more detailed than usual, so I figured it'd make sense as a blog post too...If my response comes across as negative, that's certainly not the intention -- being straight-forward about my experience is...I learned a lot, it just wasn't for me. Working by yourself on short(ish)-term things can get old...
Doing Freelance Data Science Consulting in 2019
About 15 months ago, I left my full-time job as a machine learning team lead with the goal of doing independent / freelance data science consulting. Since then, I’ve gotten a lot of questions about what that means and entails...I hope this blog post answers some of those questions for anybody interested in becoming or hiring a data science consultant...
How AI is helping us discover materials faster than ever
Recently, researchers at Northwestern University used AI to figure out how to make new metal-glass hybrids 200 times faster than they would have doing experiments in the lab. Other scientists are building databases of thousands of compounds so that algorithms can predict which ones combine to form interesting new materials. Others yet are using AI to mine published papers for “recipes” to make these materials...Now, instead of using artisan’s knowledge, we can use databases and computations to quickly map out exactly what makes a material so much stronger or lighter — and that has the potential to revolutionize industry after industry...
Schema Evolution in Data Lakes
There are countless articles to be found online debating the pros and cons of data lakes and comparing them to data warehouses. One of the key takeaways from these articles is that data lakes offer a more flexible storage solution. Whereas a data warehouse will need rigid data modeling and definitions, a data lake can store different types and shapes of data. This leads to the often used terms of “schema-on-write” for data warehouses and “schema-on-read” for data lakes. In other words, upon writing data into a data warehouse, a schema for that data needs to be defined. In a data lake, the schema of the data can be inferred when it’s read, providing the aforementioned flexibility. However, this flexibility is a double-edged sword and there are important tradeoffs worth considering...
Challenges to the Reproducibility of Machine Learning Models in Health Care
Reproducibility has been an important and intensely debated topic in science and medicine for the past few decades...Against this backdrop, high-capacity machine learning models are beginning to demonstrate early successes in clinical applications,3 and some have received approval from the US Food and Drug Administration. This new class of clinical prediction tools presents unique challenges and obstacles to reproducibility, which must be carefully considered to ensure that these techniques are valid and deployed safely and effectively...
Wave physics as an analog recurrent neural network
In this work, we identify a mapping between the dynamics of wave-based physical phenomena, such as acoustics and optics, and the computation in a recurrent neural network (RNN)...We show that wave-based physical systems can be trained to operate as an RNN and, as a result, can passively process signals and information in their native domain, without analog-to-digital conversion, which should result in a substantial gain in speed and a reduction in power consumption...
A Modern Introduction to Online Learning
In this monograph, I introduce the basic concepts of Online Learning through a modern view of Online Convex Optimization. Here, online learning refers to the framework of regret minimization under worst-case assumptions. I present first-order and second-order algorithms for online learning with convex losses, in Euclidean and non-Euclidean settings. All the algorithms are clearly presented as instantiation of Online Mirror Descent or Follow-The-Regularized-Leader and their variants. Particular attention is given to the issue of tuning the parameters of the algorithms and learning in unbounded domains, through adaptive and parameter-free online learning algorithms...
Our curriculum is designed to get you hired. Classes are interactive and have a rigorous structure. You'll also apply your knowledge of research, data pipelines, and APIs to build a real-world project with a small team of students from other tracks.
Cost: $0 upfront + 17% of salary for two years. $30k USD maximum total payment.
Check us out here -> Lambda School Data Science
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!
At Pear Therapeutics, we have the privilege of building the world’s first-ever class of prescription digital therapeutics. By nature of our therapeutics as digital applications, we have access to rich datasets and unique opportunities to drive clinical outcomes. We are currently hiring for 2 positions:
Data Scientist: As a Data Scientist, you will be responsible for shaping and delivering data-driven insights. We are looking for data scientists with a deep product sense, who have an innate curiosity, and are eager to dive into large, complex datasets and create actionable insights.
Data Scientist (Platform) : We are looking for an engineering-skilled Data Scientist to provide a strong data foundation to build upon for reporting, analysis, and modeling. This candidate will sit at the intersection of data science and engineering, and work collaboratively to achieve highly impactful outcomes.
Want to post a job here? Email us for details >> email@example.com
Training & Resources
MW-GAN: Multi-Warping GAN for Caricature Generation with Multi-Style Geometric Exaggeration
Given an input face photo, the goal of caricature generation is to produce stylized, exaggerated caricatures that share the same identity as the photo. It requires simultaneous style transfer and shape exaggeration with rich diversity, and meanwhile preserving the identity of the input. To address this challenging problem, we propose a novel framework called Multi-Warping GAN (MW-GAN), including a style network and a geometric network that are designed to conduct style transfer and geometric exaggeration respectively...
Painting Many Pasts: Synthesizing Time Lapse Videos of Paintings
In this research project, we explore a new problem of synthesizing time lapse videos depicting the creation of paintings...We introduce a new video synthesis task: synthesizing time lapse videos depicting how a given painting might have been created. Artists paint using unique combinations of brushes, strokes, colors, and layers. There are often many possible ways to create a given painting. Our goal is to learn to capture this rich range of possibilities...
TensorFlow JS: Blazeface detector
Blazeface is a lightweight model that detects faces in images. Blazeface makes use of the Single Shot Detector architecture with a custom encoder. The model may serve as a first step for face-related computer vision applications, such as facial keypoint recognition...
Data Science in Production: Building Scalable Model Pipelines with Python This book provides a hands-on approach to scaling up Python code to work in distributed environments in order to build robust pipelines. Readers will learn how to set up machine learning models as web endpoints, serverless functions, and streaming pipelines using multiple cloud environments. It is intended for analytics practitioners with hands-on experience with Python libraries such as Pandas and scikit-learn, and will focus on scaling up prototype models to production....
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page . P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian