Data Science Weekly Newsletter

Issue

356

September 17, 2020

‍

Editor's Picks

‍

Good Experiment, Bad Experiment
Over the past 10 years as a product leader I’ve shipped hundreds of A/B tests and product experiments to every kind of customer, from social gamers to the most discerning enterprise software buyers. I’ve learned many lessons about building a disciplined and high-impact culture of experimentation...So, in the spirit of Ben Horowitz’s classic post, Good Product Manager / Bad Product Manager, here are my lessons learned about Good Experiments and Bad Experiments....

An AI Epidemiologist Sent the First Warnings of the Wuhan Virus
On January 9, the World Health Organization notified the public of a flu-like outbreak in China: a cluster of pneumonia cases had been reported in Wuhan, possibly from vendors’ exposure to live animals at the Huanan Seafood Market. The US Centers for Disease Control and Prevention had gotten the word out a few days earlier, on January 6. But a Canadian health monitoring platform (BlueDot) had beaten them both to the punch, sending word of the outbreak to its customers on December 31...

The AI delusion: why humans trump machines
As Gary Marcus and Ernest Davis explain in Rebooting AI, the reason we might want to make AI more human-like is not to simulate a person but to improve the performance of the machine. Trained as a cognitive scientist, Marcus is one of the most vocal and perceptive critics of AI hype, while Davis is a prominent computer scientist; the duo are perfectly positioned to inject some realism into this hyperbole-prone field...

‍

A Message From This Week's Sponsor

‍

Help meet the growing demand in data science.

The Data Science Career Track is a 6-month, self-paced online course that will pair you with your own industry expert mentor as you learn skills like data wrangling and data storytelling, and build your unique portfolio to stand out in the job market.
Land your dream job as data scientist within six months of graduating or the course is free.

‍

Data Science Articles & Videos

‍

Someone used neural networks to upscale a famous 1896 video to 4k quality
Arrival of a Train at La Ciotat is one of the most famous films in cinema history. Shot by French filmmakers Auguste and Louis Lumière, it achieved an unprecedented level of quality for its time. Some people regard its commercial exhibition in 1896 as the birth of the film industry...Today, the Lumière brothers' masterpiece looks grainy, murky, and basically ancient. But a man named Denis Shiryaev used modern machine-learning techniques to upscale the classic film to 21st-century video standards...

Towards a Conversational Agent that Can Chat About...Anything
In “Towards a Human-like Open-Domain Chatbot”, we present Meena, a 2.6 billion parameter end-to-end trained neural conversational model. We show that Meena can conduct conversations that are more sensible and specific than existing state-of-the-art chatbots. Such improvements are reflected through a new human evaluation metric that we propose for open-domain chatbots, called Sensibleness and Specificity Average (SSA), which captures basic, but important attributes for human conversation. Remarkably, we demonstrate that perplexity, an automatic metric that is readily available to any neural conversational models, highly correlates with SSA...

Understanding GauGAN Part 1: Unraveling Nvidia's Landscape Painting GANs
One of the most interesting papers presented at CVPR in 2019 was Nvidia's Semantic Image Synthesis with Spatially-Adaptive Normalization. This features their new algorithm, GauGAN, which can effectively turn doodles into reality...In this article we'll see how the GauGAN algorithm works on a granular level. We'll also gain insight into why Nvidia is investing so heavily in the use of these algorithms...

Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities
New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include a myriad of properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle...In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field...

HiPlot: High-dimensional interactive plots made easy
HiPlot is a lightweight interactive visualization tool to help AI researchers discover correlations and patterns in high-dimensional data. It uses parallel plots and other graphical ways to represent information more clearly, and it can be run quickly from a Jupyter notebook with no setup required. HiPlot enables machine learning (ML) researchers to more easily evaluate the influence of their hyperparameters, such as learning rate, regularizations, and architecture. It can also be used by researchers in other fields, so they can observe and analyze correlations in data relevant to their work...

SciPy 1.0: fundamental algorithms for scientific computing in Python
SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments...

AI Neural Networks being used to generate HQ textures for older video games
Long story short, Enhanced Super Resolution Generative Adverserial Network, or ESRGAN, is an upscaling method that is capable of generating realistic textures during single image super-resolution. Basically it's a machine learning technique that uses a generative adverserial network to up-res smaller images. By doing it over several passes, it will usually produce an image with more fidelity than methods such as SRCNN and SRGAN. In fact, ESRGAN is based off SRGAN. The difference between the two is that ESRGAN improves on SRGAN's network architecture, adversarial loss and perceptual loss...ESRGAN has been used to improve the textures of older games such as Doom and Morrowind. In fact, there's a DOOM texture pack that was released recently using this method...

Learning Discrete Distributions by Dequantization
Media is generally stored digitally and is therefore discrete. Many successful deep distribution models in deep learning learn a density, i.e., the distribution of a continuous random variable. Naïve optimization on discrete data leads to arbitrarily high likelihoods, and instead, it has become standard practice to add noise to datapoints. In this paper, we present a general framework for dequantization that captures existing methods as a special case. We derive two new dequantization objectives: importance-weighted (iw) dequantization and Rényi dequantization. In addition, we introduce autoregressive dequantization (ARD) for more flexible dequantization distributions...

Building domain specific natural language applications
In this episode of the Data Exchange I speak with David Talby, co-creator of Spark NLP, an open source, highly scalable, production grade natural language processing (NLP) library. Spark NLP has become one of the more popular NLP libraries and is available on PyPI, Conda, Maven, and Spark Packages. With recent advances in research in large-scale natural language models, there is strong interest in domain specific natural language applications. Besides their work on Spark NLP, David and his collaborators are building natural language models tuned specifically for healthcare applications...

‍

Training

‍

Data Science:
Intensive training for a career in applied statistics and machine learning

Our curriculum is designed to get you hired. Classes are interactive and have a rigorous structure. You'll also apply your knowledge of research, data pipelines, and APIs to build a real-world project with a small team of students from other tracks.
Cost: $0 upfront + 17% of salary for two years. $30k USD maximum total payment.
Check us out here -> Lambda School Data Science

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Director of Data Science - Komodo Health - NYC

Komodo Health is addressing the global burden of disease through the development of the world’s most actionable map of healthcare data. As a fast-growing startup that has already partnered with multiple Fortune 500 companies, we have very ambitious goals that have been designed with career development in mind.
We are looking for a Director of Data Science to play a critical role in the success of our growing Data Science team. You will lead a group of data scientists and data engineers within the Data Science team that is involved in all aspects of building data products...

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications
Generative adversarial networks (GANs) are a hot research topic recently. GANs have been widely studied since 2014, and a large number of algorithms have been proposed. However, there is few comprehensive study explaining the connections among different GANs variants, and how they have evolved. In this paper, we attempt to provide a review on various GANs methods from the perspectives of algorithms, theory, and applications. Firstly, the motivations, mathematical representations, and structure of most GANs algorithms are introduced in details. Furthermore, GANs have been combined with other machine learning algorithms for specific applications, such as semi-supervised learning, transfer learning, and reinforcement learning. This paper compares the commonalities and differences of these GANs methods. Secondly, theoretical issues related to GANs are investigated. Thirdly, typical applications of GANs in image processing and computer vision, natural language processing, music, speech and audio, medical field, and data science are illustrated. Finally, the future open research problems for GANs are pointed out...

Fun with Hidden Markov Models – An Introduction to HMM
This notebook introduces the Hidden Markov Model (HMM), a simple model for sequential data...We will see: a) what an HMM is and when you might want to use it, b) the so-called "three problems" of an HMM, c) and how to implement an HMM in PyTorch...

A Complete Pandas Glossary for Data Science
Like most others, I tried to learn Pandas through boot camps — unfortunately, the problem with boot camps is that you forget everything pretty quickly if you don’t practice what you learn!...At the same time, I found that there was a need for a central Pandas resource that I could refer to when working on personal data science projects. That’s how this came into fruition. Use this as a resource to learn Pandas and also to refer to!...

‍

Books

‍

Data Science in Production: Building Scalable Model Pipelines with Python
This book provides a hands-on approach to scaling up Python code to work in distributed environments in order to build robust pipelines. Readers will learn how to set up machine learning models as web endpoints, serverless functions, and streaming pipelines using multiple cloud environments. It is intended for analytics practitioners with hands-on experience with Python libraries such as Pandas and scikit-learn, and will focus on scaling up prototype models to production....
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page
.

‍