Data Science Weekly Newsletter - Issue 428

Issue #395

June 17 2021

Editor Picks
  • Learning Bayesian Statistics Podcast Episode #41:
    Thinking Bayes, with Allen Downey

    Let’s think Bayes, shall we? And who better to do that than the author of the well known book, Think Bayes — Allen Downey himself!...In this special episode, Allen and I talked about his background, how he came to the stats and teaching worlds, and why he wanted to write this book in the first place...We also talked about some types of models, their usefulness and their weaknesses, but I’ll let you discover that...
  • Predicting Consumer Contracts
    This Article empirically examines whether a computational language model can read and understand consumer contracts. Language models are able to perform a wide range of complex tasks by predicting the next word in a sequence. In the legal domain, language models can summarize laws, draft case documents, and translate legalese into plain English. However, the ability of language models to inform consumers of their contractual rights and obligations has not been explored in detail...

A Message from this week's Sponsor:


Kickstart Your New Career with a Data Science & Analytics Bootcamp

Don’t miss your chance to join a Data Scientist-led, online Metis bootcamp plus get career support until you’re hired. Bootcamps are starting soon! Ready to take your data science or analytics career to the next level? Learn more about the Metis Online Data Science & Analytics Bootcamps.



Data Science Articles & Videos

  • FIDENZA - Generative Algorithm for Art, Explained
    Fidenza is my most versatile generative algorithm to date. Although it is not overly complex, the core structures of the algorithm are highly flexible, allowing for enough variety to produce continuously surprising results...This is why I’m so excited that Fidenza is being showcased on Art Blocks, the only site in existence that perfectly suits generative art and raises the bar for developing these kinds of high-quality generative art algorithms...Let’s dive into how Fidenza works, examine some of the unique features, and have a look at the variety of output...
  • Dask vs Vaex - a qualitative comparison
    There are several popular technologies in the Python ecosystem that are frequently used for processing large datasets in the context of data science and data engineering...Dask is an open-source, general framework for parallel and distributed computations in Python. It is often the go-to technology for horizontal scaling of various types of computations and data science tasks...Vaex is a high-performance DataFrame library in Python, primarily built for the processing, exploration and analysis of datasets as large as the size of your hard-drive and on a single machine... So one might wonder: "What is the real difference between these two libraries?" or "When or why would I want to use Vaex"? In what follows I will highlight the main differences between these two technologies, hoping to improve your understanding of them, and enable you to make a more informed choice when choosing the right tools for your use-case...
  • Extreme Classification with Similarity Search
    This demo aims to label new texts automatically when the number of possible labels is enormous. This scenario is known as extreme classification, a supervised learning variant that deals with multi-class and multi-label problems involving many choices...Examples for applying extreme classification are labeling a new article with Wikipedia’s topical labels, matching web content with a set of relevant advertisements, classifying product descriptions with catalog labels, and classifying a resume into a collection of pertinent job titles...
  • Analysis of “What’s 2/3 of the Average”
    The 2/3 of the average problem is a well known puzzle in game theory, and it illustrates some fundamental game theoretic concepts. To recap, here’s the problem statement:..Suppose everyone in your town selects a real number between 0 and 100, inclusive (i.e. 0 and 100 are both possible choices, as is any other number between). The winner is the individual (or individuals) who selects the number closest to 2/3 of the average of numbers chosen. What number do you choose? Why?...
  • A Gentle Introduction to Multi-Objective Optimisation Talk by Eyal Kazin [Video]
    Multi-Objective Optimisation, also known as Pareto Optimisation, is a method to optimise for multiple parameters simultaneously. When applicable, this method provides better results than the common practice of combining multiple parameters into a single parameter heuristic...The single heuristic approach is like horse binders limiting the view of the solution space, whereas Pareto Optimisation enables a bird’s eye view...Real world applications span from supply chain management, manufacturing, aircraft design to land use planning...This hands-on tutorial is geared towards anyone interested in improving their optimisation skills (e.g, analysts, scientists, engineers, economists)...
  • Learning an Accurate Physics Simulator via Adversarial Reinforcement Learning
    Simulation empowers various engineering disciplines to quickly prototype with minimal human effort...However, as the hand-derived physics in simulations does not match the real world exactly, control policies trained entirely within simulation can fail when tested on real hardware — a challenge known as the sim-to-real gap or the domain adaptation problem...In our ICRA 2021 publication “SimGAN: Hybrid Simulator Identification for Domain Adaptation via Adversarial Reinforcement Learning”, we propose to treat the physics simulator as a learnable component that is trained by Deep Reinforcement Learning with a special reward function that penalizes discrepancies between the trajectories (i.e., the movement of the robots over time) generated in simulation and a small number of trajectories that are collected on real robots...
  • Numerical investigation of minimum drag profiles in laminar flow using deep learning surrogates
    Efficiently predicting the flowfield and load in aerodynamic shape optimisation remains a highly challenging and relevant task. Deep learning methods have been of particular interest for such problems, due to their success for solving inverse problems in other fields. In the present study, U-net based deep neural network (DNN) models are trained with high-fidelity datasets to infer flow fields, and then employed as surrogate models to carry out the shape optimisation problem, i.e. to find a drag minimal profile with a fixed cross-section area subjected to a two-dimensional steady laminar flow...
  • Thinking Like Transformers
    What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no such familiar parallel. In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language. We map the basic components of a transformer-encoder -- attention and feed-forward computation -- into simple primitives, around which we form a programming language: the Restricted Access Sequence Processing Language (RASP)...
  • Are Sophisticated Machine Learning Designs Your Go-To Solution? Here’s Why They Shouldn’t Be
    As a data scientist, my daily work revolves around our ATO (Account Takeover) product and improvement research. Briefly, an ATO is when a bad actor gains access to another party’s legitimate account. The product decides whether a login is an ATO by using internal data supplemented by additional external data during login...In this post, I’ll share with you how we approached this problem for ATO and why the newest, state-of-the-art, or most sophisticated solution isn’t always the best option. You can provide simpler, faster, but not inferior results for your multiple-component problems without rushing into sophisticated designs. Feature engineering and domain knowledge can go a long way...



Quick Question For You: Do you want a Data Science job?

After helping hundred of readers like you get Data Science jobs, we've distilled all the real-world-tested advice into a self-directed course.

The course is broken down into three guides:
  1. Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)

  2. Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate

  3. Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!
Click here to learn more ...

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!



  • Senior Data Scientist - WarnerMedia - New York, NY

    WarnerMedia is a leading media and entertainment company that creates and distributes premium and popular content from a diverse array of talented storytellers and journalists to global audiences through its consumer brands including: HBO, HBO Max, Warner Bros., TNT, TBS, truTV, CNN, DC Entertainment, New Line, Cartoon Network, Adult Swim, Turner Classic Movies and others.

    Reporting to the Sr. Manager, Data Science this role will help to develop the predictive insights and prescriptive capabilities behind CNN’s emerging products, transforming first- and third- party data into quantitative findings, visualizations, and automation

        Want to post a job here? Email us for details >>


Training & Resources

  • Welcome to the 🤗 (Hugging face) Course!
    This course will teach you about natural language processing (NLP) using libraries from the Hugging Face ecosystem — 🤗 Transformers, 🤗 Datasets, 🤗 Tokenizers, and 🤗 Accelerate — as well as the Hugging Face Hub. It’s completely free and without ads...
  • WaveNet Deep Dive
    WaveNet creates more natural-sounding speech for products used by millions of people around the world...Discover how WaveNet has evolved from research concept to advanced real-world system that creates more natural-sounding speech and helps Google unblock communication barriers for millions of people around the world...
  • dida
    Dida is a (WIP) library for streaming, incremental, iterative, internally-consistent computation on time-varying collections...Dida is heavily based on differential dataflow and is informed by experience using differential dataflow as a backend at materialize...You write code that manipulates collections using familiar operations like map, join and loop. You run the code on some input and get some output. Then when the input changes, you get changes to the output, much faster than recomputing the whole thing from scratch. (And the outputs will be correct!)...




  • Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

    Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

    P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian
Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.