Data Science Weekly Newsletter

Issue

395

June 17, 2021

‍

Editor's Picks

‍

The second decade of synthetic biology: 2010–2020
Synthetic biology is among the most hyped research topics this century, and in 2010 it entered its teenage years. But rather than these being a problematic time, we’ve seen synthetic biology blossom and deliver many new technologies and landmark achievements...In 2020 synthetic biology turned 20 years old. It’s first decade saw some impressive research papers, lots of visionary thinking and unprecedented excitement, but its second decade—from 2010 to 2020—was when the hype really needed to be replaced by some real achievements. So how has it done?...

Customizing Triggers with Concealed Data Poisoning
We develop a new data poisoning attack that controls model predictions whenever a desired trigger phrase appears in the input...Modern NLP has an obsession with gathering large training sets...What are the dangers of using such untrusted data?...A potential concern is data poisoning attacks, where an adversary inserts a few malicious examples into a victim's training set in order to manipulate their trained model. Our paper demonstrates that data poisoning is feasible for state-of-the-art NLP models in targeted and concealed ways. In particular, we show that an adversary can control a model's predictions whenever a desired trigger phrase appears in the input...

Build better intuitions about different machine learning and deep learning methods [Tweet Thread]
People often ask me how to build better intuitions about different machine learning and deep learning methods. This is a thread about my experience (as an NLP Researcher) building better intuitions of ML/deep learning methods, including resources and tips...

‍

A Message From This Week's Sponsor

‍

Find your next job through Vettery

Vettery is an online hiring marketplace that's home to thousands of actively hiring startups and Fortune 500 companies. Create a free profile, name your salary, and connect with hiring managers looking to grow their teams.
Get started - it’s completely free for job-seekers!

‍

Data Science Articles & Videos

‍

Algorithms for Causal Reasoning in Probability Trees
Probability trees are one of the simplest models of causal generative processes. They possess clean semantics and -- unlike causal Bayesian networks -- they can represent context-specific causal dependencies, which are necessary for e.g. causal induction. Yet, they have received little attention from the AI and ML community. Here we present concrete algorithms for causal reasoning in discrete probability trees that cover the entire causal hierarchy (association, intervention, and counterfactuals), and operate on arbitrary propositional and causal events. Our work expands the domain of causal reasoning to a very general class of discrete stochastic processes...

How I got 10k post karma on reddit with (and without) fast.ai
Back in 2006-2007 my friend and I put together a spreadsheet of 20 or so high-level achievements called “Everything’s a Contest”. This included goals like “Photograph a live grizzly bear in the wild”, “Have something named after you”, and “Get 10,000 (post) karma on Reddit”...In early 2020 I decided to tackle one of these long-standing contests. But I was going to do it with AI since I wanted to see how I could apply AI to more of my problems. I’m a huge fan of fast.ai and I appreciate its high-level abstractions and simple interfaces. For someone trying to get into deep learning, I would highly recommend it and the associated courses. This is a post about how I built a bot to gain karma on Reddit with fast.ai...

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores
Netflix has more than 195 million subscribers that generate petabytes of data everyday...Usually Data scientists and engineers write Extract-Transform-Load (ETL) jobs and pipelines using big data compute technologies, like Spark or Presto, to process this data and periodically compute key information for a member or a video...we also heavily embrace a microservice architecture that emphasizes separation of concerns. Many of these services often have the requirement to do a fast lookup for this fine-grained data which is generated periodically...The data warehouse is not designed to serve point requests from microservices with low latency. Therefore, we must efficiently move data from the data warehouse to a global, low-latency and highly-reliable key-value store...introducing: Bulldozer is a self-serve data platform that moves data efficiently from data warehouse tables to key-value stores in batches...

Comparing Data Version Control Tools
Whether you’re using logistic regression or a neural network, all models require data in order to be trained, tested, and deployed. Managing and creating the data sets used for these models requires lots of time and space, and can quickly become muddled due to multiple users altering and updating the data...This can lead to unexpected outcomes as data scientists continue to release new versions of the models but test against different data sets. Many data scientists could be training and developing models on the same few sets of training data. This could lead to many subtle changes being made to the data set, which can lead to unexpected outcomes once the models are deployed...This blog post discusses the many challenges that come with managing data, and provides an overview of the top tools for machine learning and data version control...

Navigating the landscape of multiplayer games
Multiplayer games have long been used as testbeds in artificial intelligence research...Traditionally, researchers have focused on using well-known games to build strong agents. This progress, however, can be better informed by characterizing games and their topological landscape. Tackling this latter question can facilitate understanding of agents and help determine what game an agent should target next as part of its training. Here, we show how network measures applied to response graphs of large-scale games enable the creation of a landscape of games, quantifying relationships between games of varying sizes and characteristics....

Create your own smart baby monitor with a RaspberryPi and Tensorflow
Commercial baby monitors are dumber than the ideal device I’d want. They don’t detect your baby’s cries — they simply act like intercoms that take sound from a source to a speaker...So I’ve come with a specification for a smart baby monitor...It should run on anything as simple and cheap as a RaspberryPi with a cheap USB microphone...It should detect my baby’s cries and notify me (ideally on my phone) when he starts/stops crying, or track the data points on my dashboard, or do any kind of tasks that I’d want to run when my son is crying. It shouldn’t only act as a dumb intercom that delivers sound from a source to one single type of compatible device...Let’s see how to use our favourite open-source tools to get this job done...

Weird A.I. Yankovic
Weird A.I. Yankovic is a neural network based lyric generation system. Given a syllable and rhyme scheme, it attempts to generate new lyrics that fit that scheme...The intended use is to generate new lyrics for existing songs by feeding in the syllable and rhyme scheme for the song and then some contextualization information...It does not sing or match the lyrics to the music. You have to do that yourself. To make that easier, there are routines at the end for creating a karaoke video...

Launch HN: Deepnote (YC S19) – A better data science notebook
Two years ago, my co-founders and I started to think about a better data science notebook. Deepnote is built on top of the Jupyter ecosystem. We are using the same format, and we intend to remain fully compatible in both directions. But to solve the above problems, we've introduced significant changes...First, we made collaboration a first-class citizen...Second, we completely redesigned the interface to encourage best practices, write clean code, define dependencies, and create reproducible notebooks...Third, we made Deepnote easy to integrate with other services...

Detecting Fake News
In this episode of the Data Exchange PodcastI speak with Xinyi Zhou, a graduate student in Computer and Information Science at Syracuse University. Xinyi and her advisor (Reza Zafarani) recently wrote a comprehensive survey paper entitled “A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities”. They set out to organize the many different methods and perspectives used to detect fake news. Their paper is a great resource for anyone wanting to understand the strengths and limitations of various state-of-the-art techniques, and a feel for where the research community might be headed in the near future...

‍

Training

‍

Quick Question For You: Do you want a Data Science job?

After helping hundred of readers like you get Data Science jobs, we've distilled all the real-world-tested advice into a self-directed course.
The course is broken down into three guides:

Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)

Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate

Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!

Click here to learn more
...
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Product Analyst, Data Scientist - Google - New York, NY

Product Analysts provide quantitative analysis, market expertise and a strategic perspective to our partners throughout the organization. As a data-loving member of the team, you'll serve as an analytics expert for your partners, using numbers to help them make better decisions. You will weave stories with meaningful insight from data. You'll make key recommendations for your fellow Googlers in Engineering and Product Management.
As a Product Analyst, you relish tallying up the numbers one minute and communicating your findings to a team leader the next. You can see different angles of a product or business opportunity, and you know how to connect the dots and interact with people in various roles and functions. You will work to effectively turn business questions into data analysis, and provide meaningful recommendations on strategy....

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

An introduction to Pluto
Pluto is a new computational notebook for the Julia programming language. Computational notebooks are a way to program inside of a web browser, storing code, annotations, and output, including graphics, in a single place...Pluto appears, at first glance, quite like Jupyter and similar notebooks, but the way it works is rather different...There is no hidden state, and nothing depends on the order in which cells were run... Pluto analyzes the code in all of the cells and constructs a dependency graph, so that it knows the order in which the cells must be executed; this is based on which cells use which variables. Cells can be grabbed with the mouse and arranged in any order and this has no effect on the results. When the code in a cell is changed, the cell is run; all of the cells that depend on it, and only those cells, are also run, in dependency order. Therefore one is not allowed to define a global variable in more than one cell; an attempt to do so results in an error message...

Recent Developments in Graph Network Architectures [PDF]
Lecture slides from a class titled "Deep Learning for Data Science" at the Nanyang Technological University (NTU), Singapore...Lecture by Xavier Bresson...covers a review of some exciting works on GNNs published in 2019-2020...

My [Mat Kelcey] updated list of cool machine learning books
awhile ago i posted my list of cool machine learning books, but it's been awhile so it's probably time to update it...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page
.

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍