Data Science Weekly Newsletter

Issue

419

December 2, 2021

‍

Editor's Picks

‍

Flux <3 NumFOCUS
We are very excited to announce that FluxML is partnering with NumFOCUS as an affiliated project to further the cause of open and reproducible science and growing the adoption of the FluxML ecosystem. Flux has always had the mission of being a simple, hackable and performant approach to machine learning, which is extended to a number of domains in science by means of differentiable programming...This milestone is the result of the coming together of the Julia community to support the vision of producing high performance machine learning tools which are flexible towards the needs of novel use cases such as: graph neural networks, scientific machine learning, and differentiable programming...

30 days and as many maps
Writing this, looking back on the last 30 days, I realize how much of the fun I have writing these notebooks depends on the existence and vibrance of a large community and on the activity of all the people (past and present) who did research, published it, compiled datasets, created software, created visualizations, and spread all kinds of enthusiasm; people who enjoy sharing, explaining what they do and encouraging others to try and make stuff, and get excited about new ways of seeing and representing spatial (or non-spatial) data...

The Impending Cloud Reshuffle
Here's a theory I have about cloud vendors (AWS, Azure, GCP): 1) Cloud vendors will increasingly focus on the lowest layers in the stack: basically leasing capacity in their data centers through an API and 2) Other pure-software providers will build all the stuff on top of it. Databases, running code, you name it...let me walk you through my thinking—I think some of it is quite well illustrated through the story of Redshift...

‍

A Message From This Week's Sponsor

‍

High quality data labeling, consistently Edge cases are the most common challenges that ML teams face when training their AI models, making it difficult to reach 95+% accuracy. This can be more complex once you need to scale and start working with 3rd party data labeling solutions. The evaluation metrics that we use to measure the quality of labeled data - Intersection over Union (IOU) and F1 score - has allowed us to make swift adjustments on the go and continuously improve the quality of our labeling standards. To find out more and start exploring our end-to-end data labeling service, speak to the team at Supahands today

‍

Data Science Articles & Videos

‍

OpenAI Residency
As part of our effort to support and develop AI talent, we’re excited to announce the OpenAI Residency. This new program offers a pathway to a full-time role at OpenAI for researchers and engineers who don’t currently focus on artificial intelligence. We are excited to get applications from everyone, and will make a special effort to hear from underrepresented groups in technology...

Exploring the beauty of pure mathematics in novel ways
As part of DeepMind's mission to solve intelligence, we explored the potential of machine learning (ML) to recognize mathematical structures and patterns, and help guide mathematicians toward discoveries they may otherwise never have found — demonstrating for the first time that AI can help at the forefront of pure mathematics...Our research paper, published today in the journal Nature, details our collaboration with top mathematicians to apply AI toward discovering new insights in two areas of pure mathematics: topology and representation theory...

Bird-inspired dynamic grasping and perching in arboreal environments
Birds take off and land on a wide range of complex surfaces. In contrast, current robots are limited in their ability to dynamically grasp irregular objects. Leveraging recent findings on how birds take off, land, and grasp, we developed a biomimetic robot that can dynamically perch on complex surfaces and grasp irregular objects. To accommodate high-speed collisions, the robot’s two legs passively transform impact energy into grasp force, while the underactuated grasping mechanism wraps around irregularly shaped objects in less than 50 milliseconds...

Predicting long-term user engagement from short-term behavior
A problem that a company may want to address is how to derive insights from data on already engaged users to identify any common behavior patterns that can be leveraged to promote the same level of engagement in new users...In discussing the problem, they can identify two segments of their long-term engaged user population base that they wanted to understand: a) “Day One” Users — consistent, regular users from day one and b) “Late Bloomer” Users — sporadic early users, with an increase in engagement at a later date...The behaviors of these two segments can be seen in the following figures...

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort has been made at assessing the fitness of these datasets for the video-language grounding task...In this work, we present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations and focuses on crawling and aligning available audio descriptions of mainstream movies...

What Data Science candidates can and cannot control in their job hunt
Having been involved in quite a few rounds of hiring data scientists in a biomedical research context, I'd like to share some perspectives that may help candidates who desire a move into a data science role in biomedical research. I'll start off with the usual disclaimer that these are personal observations and thoughts; they may not apply uniformly to all biomedical data science teams, and may reflect personal biases. With that disclaimer out of the way, here are my observations...

Kinematic self-replication in reconfigurable organisms
Here we show that clusters of cells, if freed from a developing organism, can similarly find and combine loose cells into clusters that look and move like they do, and that this ability does not have to be specifically evolved or introduced by genetic manipulation. Finally, we show that artificial intelligence can design clusters that replicate better, and perform useful work as they do so. This suggests that future technologies may, with little outside guidance, become more useful as they spread, and that life harbors surprising behaviors just below the surface, waiting to be uncovered...

Path integral control theory
Control theory is a theory from engineering that gives a formal description of how a system, such as a robot or animal, can move from a current state to a future state at minimal cost, where cost can mean time spent, or energy spent or any other quantity. Control theory is used traditionally to control industrial plants, airplanes or missiles, but is also the natural framework to model intelligent behavior in animals or robots. The mathematical formulation of deterministic control theory is very similar to classical mechanics. In fact, classical mechanics can be viewed as a special case of control theory...

Gaussian Process: First Step Towards Active Learning in Physics
Despite the extreme disparity in terms of objects and study methods, some tasks are common across multiple scientific fields. One of such tasks is an interpolation...This can be approached using multiple methods including splines, kernel density approximations, neural network fits, and many others. However, when doing so, the second natural question is the uncertainty of these interpolated values, or to which extent they are trustable...Finally, the third and perhaps most interesting question is whether we can use the knowledge of the interpolated function and its uncertainty to guide our search strategy...All these problems can be addressed in a principled manner using Gaussian Process (GP) and GP-based Bayesian Optimization...

Procedural storytelling is exploding the possibilities of video game narratives
Procedural stories in video games often induce a specific kind of delight. You’ll know when it hits — a realization that the code and algorithms of the game seem to be generating a coherent narrative from your own impulsive, seemingly chaotic actions...Drama, as video games continue to prove, is harder to convince players of than space itself, which makes procedural successes all the more eye-catching — from mainstream hits such as The Sims to cult classics like Rimworld. Now it feels like this sandbox approach to storytelling is starting to bear even greater narrative fruit...

‍

Tools

‍

Retool is the fast way to build an interface for any database With Retool, you don't need to be a developer to quickly build an app or dashboard on top of any data set. Data teams at companies like NBC use Retool to build any interface on top of their data—whether it's a simple read-write visualization or a full-fledged ML workflow. Drag and drop UI components—like tables and charts—to create apps. At every step, you can jump into the code to define the SQL queries and JavaScript that power how your app acts and connects to data. The result—less time on repetitive work and more time to discover insights. *Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

R&D Data Scientist - Danaher - Port Washington, NY As a Data Scientist at IBM, you will help transform our clients’ data into tangible business value by analyzing information, communicating outcomes and collaborating on product development. Work with Best in Class open source and visual tools, along with the most flexible and scalable deployment options. Whether it’s investigating patient trends or weather patterns, you will work to solve real world problems for the industries transforming how we live.

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

Pytorch Conv2d Weights Explained
Understanding weights dimension, visualization, number of parameters and the infamous size mismatch...One of the most common problems I have found in my journey with Pytorch is the size mismatch error when uploading weights to my models. As you know, Pytorch does not save the computational graph of your model when you save the model weights (on the contrary to TensorFlow). So when you train multiple models with different configurations (different depths, width, resolution…) it is very common to misspell the weights file and upload the wrong weights for your target model...This misspell translates into the infamous Pytorch error for the Conv2d weights: the size mismatch...

Random Forests Algorithm explained with a real-life example and some Python code
Random Forests is a Machine Learning algorithm that tackles one of the biggest problems with Decision Trees: variance...To address overfitting, and reduce the variance in Decision Trees, Leo Breiman developed the Random Forests algorithm. This was an innovative algorithm because it utilized, for the first time, the statistical technique of Bootstrapping and combined the results of training multiple models into a single, more powerful learning model...But before you see Random Forests in action, and code, let’s take a detour to explore what makes Random Forests unique...

pybaobabdt - Python implementation of visualization technique for (sklearn) decision trees
The pybaobabdt package provides a python implementation for the visualization of decision trees. The technique is based on the scientific paper BaobabView: Interactive construction and analysis of decision trees developed by the TU/e. A typical decision tree is visualized using a standard node link diagram...The problem, however, is that information is not easily extracted from this. Which classes are easy to separate for example, which classes are similar, where does the main flow of items go etc. Therefore, we developed techniques to answer these questions with a scalable visualization...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍