Data Science Weekly Newsletter

Issue

418

November 25, 2021

‍

Editor's Picks

‍

The Fry Universe
You probably like some types of fries more than others...3D modeling of various fry shapes reveals why you like some more than others...

Neural-Control Family: What Deep Learning + Control Enables in the Real World
With the unprecedented advances of modern machine learning comes the tantalizing possibility of smart data-driven autonomous systems across a broad range of real-world settings. However, is machine learning (especially deep learning) really ready to be deployed in safety-critical systems?...

Transformers from Scratch
I procrastinated a deep dive into transformers for a few years. Finally the discomfort of not knowing what makes them tick grew too great for me. Here is that dive...Transformers were introduced in...2017 paper as a tool for sequence transduction—converting one sequence of symbols to another. The most popular examples of this are translation, as in English to German. It has also been modified to perform sequence completion—given a starting prompt, carry on in the same vein and style. They have quickly become an indispensible tool for research and product development in natural language processing....

‍

A Message From This Week's Sponsor

‍

Retool is the fast way to build an interface for any database With Retool, you don't need to be a developer to quickly build an app or dashboard on top of any data set. Data teams at companies like NBC use Retool to build any interface on top of their data—whether it's a simple read-write visualization or a full-fledged ML workflow. Drag and drop UI components—like tables and charts—to create apps. At every step, you can jump into the code to define the SQL queries and JavaScript that power how your app acts and connects to data. The result—less time on repetitive work and more time to discover insights.

‍

Data Science Articles & Videos

‍

Data Advantage Matrix: A New Way to Think About Data Strategy
As the co-founder of two data start-ups, one question I get all the time is, “How do I get started with my data strategy? Where do we start? What do we prioritize?”... In NewVantage Partners’ annual survey, the percentage of companies that invest in data initiatives was near-universal (literally 99% in 2021) for the third year in a row...But while investing in data is a given, actually using data can feel like a crapshoot. In that same survey, only 24% of companies said that they had created a data-driven culture....In this article, I’ll break down how to think about your data strategy...and give examples of how two hypothetical companies would use it...

BookNLP
BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including: Part-of-speech tagging, Dependency parsing, Entity recognition, Character name clustering and coreference resolution, Quotation speaker identification, Supersense tagging, Event tagging, Referential gender inference, and more...

The missing analytics executive
Despite the apparent discrepancies in title (CTO sounds higher than VP) and responsibilities (leading a department sounds more important than tinkering), the two roles are peers. Both are senior executives, and both often report to the CEO. The division of labor is a recognition not of hierarchy, but that there’s enough important labor in engineering that it needs to be divided: One role to manage, and one to advise...Data departments should follow the same pattern. Rather than being led by a single ambiguously defined and overburdened CDO, data teams should have two representatives in senior management: A VP of data responsible for managing the team’s daily operations, and a chief analytics officer...

Getting into the subspace; or what happens when you approximate a Gaussian process
The problem with Gaussian processes, at least from a computational point of view, is that they’re just too damn complicated. Because they are supported on some infinite dimensional Banach space B, the more we need to see of them (for instance because we have a lot of unique sis) the more computational power they require. So the obvious solution is to somehow make Gaussian processes less complex...This somehow has occupied a lot of people’s time over the last 20 years and there are many many many many possible options. But for the moment, I just want to focus on one of the generic classes of solutions: You can make Gaussian processes less computationally taxing by making them less expressive...

Bernoulli's Fallacy & the Crisis of Modern Science, with Aubrey Clayton
I love epistemology — the study of how we know what we know...So..It was high time I dedicated a whole episode to this topic. And what better guest than Aubrey Clayton, the author of the book Bernoulli's Fallacy: Statistical Illogic and the Crisis of Modern Science...Aubrey is a mathematician in Boston who teaches the philosophy of probability and statistics at the Harvard Extension School and he holds a PhD in mathematics from Berkeley...We talked about what he deems “a catastrophic error in the logic of the standard statistical methods in almost all the sciences” and why this error manifests even outside of science, like in medicine, law, public policy, etc...But don’t worry, we’re not doomed — we’ll also see where we go from there. As a big fan of E.T Jaynes, Aubrey will also tell us how this US scientist influenced his own thinking as well as the field of Bayesian inference in general....

Reddit Discussion: Anyone regret coming to this field?
If yes, which path would you have taken?...

A Survey of Generalisation in Deep Reinforcement Learning
The study of generalisation in deep Reinforcement Learning (RL) aims to produce RL algorithms whose policies generalise well to novel unseen situations at deployment time, avoiding overfitting to their training environments. Tackling this is vital if we are to deploy reinforcement learning algorithms in real world scenarios, where the environment will be diverse, dynamic and unpredictable. This survey is an overview of this nascent field. We provide a unifying formalism and terminology for discussing different generalisation problems, building upon previous works...

Paper Digest: NeurIPS 2021 Highlights
One sentence highlight for every NeurIPS-2021 Paper, plus code for 200 of them...

Machine Learning Street Talk Podcast Episode #53: Quantum Natural Language Processing - Prof. Bob Coecke (Oxford)
Bob Coercke is a celebrated physicist, he's been a Physics and Quantum professor at Oxford University for the last 20 years. He is particularly interested in Structure which is to say, Logic, Order, and Category Theory. He is well known for work involving compositional distributional models of natural language meaning and he is also fascinated with understanding how our brains work...Bob thinks that interactions between systems in Quantum Mechanics carries naturally over to how word meanings interact in natural language. Bob argues that this interaction embodies the phenomenon of quantum teleportation...

Get Ready For Confidential Computing
In this post we describe the ecosystem of tools focused on protecting data while in use. Our primary focus is on Confidential Computing tools for the development of data, analytic, and AI applications. We believe that companies that are able to use data securely will be well-positioned to build data and AI applications in the future...

‍

Tools

‍

High quality data labeling, consistently Edge cases are the most common challenges that ML teams face when training their AI models, making it difficult to reach 95+% accuracy. This can be more complex once you need to scale and start working with 3rd party data labeling solutions. The evaluation metrics that we use to measure the quality of labeled data - Intersection over Union (IOU) and F1 score - has allowed us to make swift adjustments on the go and continuously improve the quality of our labeling standards. To find out more and start exploring our end-to-end data labeling service, speak to the team at Supahands today
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

R&D Data Scientist - Danaher - Port Washington, NY As a Data Scientist at IBM, you will help transform our clients’ data into tangible business value by analyzing information, communicating outcomes and collaborating on product development. Work with Best in Class open source and visual tools, along with the most flexible and scalable deployment options. Whether it’s investigating patient trends or weather patterns, you will work to solve real world problems for the industries transforming how we live.

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

Stanford CS223A - Introduction to Robotics
The purpose of this course is to introduce you to basics of modeling, design, planning, and control of robot systems. In essence, the material treated in this course is a brief survey of relevant results from geometry, kinematics, statics, dynamics, and control...

JAX Global Meetup
JAX Global is an online meetup group that hosts live events from researchers and engineers on topics related to the JAX library, machine learning and scientific computing. Join us an learn more about this wonderful framework and interact with awesome people !...

ApplyingML - Papers, Guides, and Interviews with ML practitioners
ApplyingML collects tacit/tribal/ghost knowledge on applying ML via curated papers/blogs, guides, and interviews with ML practitioners. In a nutshell, it's 1/3 applied-ml, 1/3 ghost knowledge, and 1/3 Tim Ferriss Show. The intent is to make it easier to apply—and benefit from—ML at work...

‍

Books

‍

a href="https://www.packtpub.com/data/hands-on-machine-learning-with-scikit-learn" style="color: #FF0000;font-weight: normal;text-decoration: none;">

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍