Data Science Weekly Newsletter - Issue 423

Issue #390

May 13 2021

Editor Picks
  • There’s no such thing as a tree (phylogenetically)
    “Trees” are not a coherent phylogenetic category. On the evolutionary tree of plants, trees are regularly interspersed with things that are absolutely, 100% not trees. This means that, for instance, either: a) The common ancestor of a maple and a mulberry tree was not a tree, b) The common ancestor of a stinging nettle and a strawberry plant was a tree...And this is true for most trees or non-trees that you can think of...I thought I had a pretty good guess at this, but the situation is far worse than I could have imagined...
  • How image search works at Dropbox
    When you're looking for that photo from a picnic a few years ago, you surely don't remember that the filename set by your camera was 2017-07-04 12.37.54.jpg...Instead, you look at individual photos, or thumbnails of them, and try to identify objects or aspects that match what you’re searching for—whether that’s to recover a photo you’ve stored, or perhaps discover the perfect shot for a new campaign in your company’s archives. Wouldn’t it be great if Dropbox could pore through all those images for you instead, and call out those which best match a few descriptive words that you dictated? That’s pretty much what our image search does...

A Message from this week's Sponsor:


Online Data Science Programs from Drexel University

Find your algorithm for success with an online data science degree from Drexel University. Gain essential skills in tool creation and development, data and text mining, trend identification, and data manipulation and summarization by using leading industry technology to apply to your career.

Learn more.


Data Science Articles & Videos

  • Working with JSON in Redshift
    When working with data warehouses, it’s common to have structured data stored within a table as a JSON blob. Until recently, extracting data from JSON in Redshift was extremely cumbersome. This tutorial shows you a new, easier way of working with JSON in Redshift...
  • Parsing Petabytes, SpaceML Taps Satellite Images to Help Model Wildfire Risks
    Over the past few months, my team of citizen scientists has been working towards building a Reverse Image Search Engine for the Earth. This problem was previously considered unsolvable but we came up with Curator - a pipeline that allows users to rapidly curate datasets within hours to minutes automatically. This process was originally manual and took weeks to months to accomplish. During a recent demonstration of the GIBS/Worldview imagery pipeline, a machine searched for islands through five million tiles of Earth imagery starting with a single seed image of an island. Approximately 1,000 islands were identified in just 52 minutes. If done manually, this effort would take an estimated 7,000 hours (assuming five seconds to evaluate and label each image tile) at a cost of $15 an hour, or $105,000.00...
  • Accelerating ML within CNN
    At CNN, our mission is to inform, engage, and empower the world in a way that is trusted, timely, and transparent...Our Data Intelligence team, in particular, leverages data and machine-learning capabilities to build innovative experiences for our audience and provides scalable solutions to CNN’s operations. As the world’s largest digital news destination, we averaged more than 200 million unique global visitors every month of 2020...As an informal estimate, our data science team believes they were able to test twice as many models in Q1 2021 as they did in all of 2020, with simple experiments that would have taken a week now taking half a day...
  • Advancing sports analytics through AI research
    The rapid growth of sports data collection means we are in the midst of a remarkably important era for sports analytics... In our recent paper published in collaboration with Liverpool Football Club (LFC) in JAIR, we envision the future landscape of sports analytics using a combination of statistical learning, video understanding, and game theory. We illustrate football, in particular, is a useful microcosm for studying AI research, offering benefits in the longer-term to decision-makers in sports in the form of an automated video-assistant coach (AVAC) system...
  • Momentum Residual Neural Networks
    The training of deep residual neural networks (ResNets) with backpropagation has a memory cost that increases linearly with respect to the depth of the network. A simple way to circumvent this issue is to use reversible architectures. In this paper, we propose to change the forward rule of a ResNet by adding a momentum term...We show on CIFAR and ImageNet that MomentumNets have the same accuracy as ResNets, while having a much smaller memory footprint, and show that pre-trained MomentumNets are promising for fine-tuning models...
  • Synthetic Data for Model Selection
    Recent improvements in synthetic data generation make it possible to produce images that are highly photorealistic and indistinguishable from real ones. Furthermore, synthetic generation pipelines have the potential to generate an unlimited number of images...In contrast to using synthetic data for training, in this work we explore whether synthetic data can be beneficial for model selection. Considering the task of image classification, we demonstrate that when data is scarce, synthetic data can be used to replace the held out validation set, thus allowing to train on a larger dataset...
  • Game theory as an engine for large-scale data analysis
    We present “EigenGame: PCA as a Nash Equilibrium”...our research explores a new approach to an old problem: we reformulated principal component analysis (PCA), a type of eigenvalue problem, as a competitive multi-agent game we call EigenGame. PCA is typically formulated as an optimisation problem (or single-agent problem); however, we found that the multi-agent perspective allowed us to develop new insights and algorithms which make use of the latest computational resources. This enabled us to scale to massive data sets that previously would have been too computationally demanding, and offers an alternative approach for future exploration...
  • Datasets on arXiv
    We’re excited to announce our partnership with arXiv to support links to datasets on arXiv!...Machine learning articles on arXiv now have a Code & Data tab to link to datasets that are used or introduced in a paper...From Papers with Code you can discover other papers using the same dataset, track usage over time, compare models and find similar datasets...
  • Good Data Scientist, Bad Data Scientist
    There’s a wide array of work a data scientist (DS) can be involved with. This post aims to address the common elements that will make a great DS, or a bad one, no matter what part of the stack you are working on...There’s also a core set of technical chops every DS must have: SQL, an analytical mindset, fluency in a programming language like Python or R, an understanding of statistics & common statistical procedures, and machine learning methods, if the product calls for it. But that is only half the story. If we condition on people who have those required baseline skills, this post is what separates the good from the bad...



Similarity Search as a Service

Have you ever felt like Facebook knows you better than your friends, or wondered how Spotify makes such great playlists just for you? Their secret sauce is "similarity search." Last week we shared an introduction to similarity search and some of its use cases. This week we invite you to play around with similarity search and see what you can make with

pinecone.init(api_key="YOUR_API_KEY") pinecone.create_index("hello-pinecone-index", metric="euclidean") index = pinecone.Index("hello-pinecone-index") df = pd.DataFrame(data={ "id": ["A", "B", "C", "D", "E"], "vector": [[1]*2, [2]*2, [3]*2, [4]*2, [5]*2] }) # Generate sample data index.upsert(items=zip(, df.vector)) # Insert the data index.query(queries=[[0, 1]], top_k=3) # Query the index and get similar vectors

Get your Pinecone API key and build your first similarity search application today.

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!



  • Experimental Behavioral Scientist - BetterUp, Inc. - US-based, remote

    BetterUp is a mobile-based coaching platform that brings personalized professional coaching to employees at all levels. We help managers lead better, teams perform better, and employees thrive personally and inspire professionally.

    We are seeking an experimental behavioral scientist to join our team. In this role, you will direct a portfolio of original research to answer an essential question: What makes people happy and flourishing at work?

    You’ll draw on your experience as an experimental social scientist, statistician, and lover of all things Data, to uncover groundbreaking findings at an epicenter of human experience: life at work. Your work will inform BetterUp products, inspire our customers, inform the broader scientific community, and amplify BetterUp’s reputation as a global thought-leader.

        Want to post a job here? Email us for details >>


Training & Resources

  • What is a Vector Database?
    The meteoric rise in Machine Learning in the last few years has led to increasing use of vector embeddings. They are fundamental to many models and approaches, and are a potent tool for applications such as semantic search, similarity search, and anomaly detection...The unique nature, growing volume, and rising importance of vector embeddings make it necessary to find new methods of storage and retrieval. We need a new kind of database...Let’s explore what makes vector databases unique...
  • Neural Network Embeddings Explained
    In this article, I’ll explain what neural network embeddings are, why we want to use them, and how they are learned. We’ll go through these concepts in the context of a real problem I’m working on: representing all the books on Wikipedia as vectors to create a book recommendation system...
  • How to become a Quantum Software Engineer
    Quantum computing has come a long way in the past decade, and the infamous “It’ll be ready in 5–10 years” might actually be true this time. While the hardware teams continue to make compounding progress, there is a growing need for greater tooling and quantum libraries.Even the term “quantum software engineer” didn’t even exist a decade ago!...If you’re a software engineer looking to get into the field, now is a great time to start learning and developing your skills...




  • Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

    Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

    P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian
Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.