Data Science Weekly Newsletter - Issue 425

Issue #392

May 27 2021

Editor Picks
  • Archaeologists train a neural network to sort pottery fragments for them
    Real archaeological fieldwork is seldom as exciting as it looks in the movies...You tend to get fewer reanimated mummies, deadly booby traps, and dramatic shootouts with Nazis. Instead, you'll see pieces of broken pottery—a lot of them...“An archaeologist experienced in decorated ceramics is often capable of assigning a type to a sherd in a fraction of a second, without consciously thinking of all the design rules for that type,” wrote Pawlowicz and Downum. Their CNN, on the other hand, color-coded specific features on the photos that explained its choices. By combining that ability with the more intuitive work of human archaeologists, future work could help sort out some artifacts that might otherwise go unclassified...
  • JavaScript for Data Analysis
    With the web opening new frontiers in collaboration, the web’s native language of JavaScript is the best choice for exploring data and communicating insights...

A Message from this week's Sponsor:


Ray Summit: Learn about the latest trends in scaling data & ML

Ray, the open source Python framework that simplifies distributed computing, is becoming a key technology in large scale machine learning and Python. Ray Summit brings together developers, data scientists, engineers, and architects to build scalable data & AI applications with Ray. Join and see how companies like Visa, Intel, Uber, Amazon, Ant Financial, and others are using Ray to build and scale distributed applications—with sessions on large scale data processing, petabyte-scale data lake management, scaling interactive data science, massive-scale ML—and many more. Register free to join live or on-demand.


Data Science Articles & Videos

  • “Causal Inference: The Mixtape”
    Now we have another friendly introduction to causal inference by an economist, presented as a readable paperback book with a fun title. I’m speaking of “Causal Inference: The Mixtape,” by Scott Cunningham. I like the book—all the blurbs on the back are correct...
  • Investing in startups with big ideas about AI
    The OpenAI startup fund is investing $100 million to help AI companies have a profound, positive impact on the world. We’re looking to partner with a small number of early-stage startups in fields where artificial intelligence can have a transformative effect—like health care, climate change, and education—and where AI tools can empower people by helping them be more productive...
  • In Search Of: Simpson's Paradox
    Is Simpson’s Paradox just a mathematical curiosity, or does it happen in real life? And if it happens, what does it mean? To answer these questions, I’ve been searching for natural examples in data from the General Social Survey (GSS)...
  • Potemkin Data Science
    The appearance of data science smartness is valuable, but the actual results of a data science effort might not be (or not as immediately visibly valuable)...
  • Pretrained Language Models for Text Generation: A Survey
    Text generation has become one of the most important yet challenging tasks in natural language processing (NLP)...In this paper, we present an overview of the major advances achieved in the topic of PLMs for text generation...Our survey aims to provide text generation researchers a synthesis and pointer to related research. ...
  • How Companies Are Investing in AI Risk and Liability Minimization
    In this episode of the Data Exchange Podcast I speak with Andrew Burt, co-founder and Managing Partner of BNH.ai1, a new law firm focused on AI compliance, risk mitigation, and related topics. BNH is the first law firm run by lawyers and technologists focused on helping companies identify and mitigate risks associated with machine learning and AI...
  • How to Extract Data Observability Metrics from Snowflake Using SQL
    When it comes to managing data quality in your Snowflake environment, there are few steps data teams can take to understand the health of your data from ingest to consumption...Here’s a five-step approach for extracting data observability metrics from Snowflake, and in turn, getting one step closer to trusting your data...



Free T-Shirt After Your First Similarity Search

"Love thy nearest neighbor!"

Show the world you're into algorithms and you're a kind human. Get a free t-shirt when you try Pinecone and make your first similarity search query.

Whether you start with "hello world" or a more practical example, you'll see that deploying a similarity search feature is easy with Pinecone. Common use cases include:
  • Semantic search
  • Document search
  • Image/Audio/Video search
  • Anomaly detection
  • Deduplication
  • Question-answering
  • Personalized recommendations
  • Record matching
  • Automatic labeling
  • T-shirt acquisition :)
Try similarity search with Pinecone and get your free t-shirt after your first pinecone.query() call — while supplies last!

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!



  • Data Science Roles - Blue Cross and Blue Shield of IL, MT, NM, OK, TX - Chicago, IL / Richardson, TX

    Health Care Services Corporation (HCSC), an Independent Licensee of the Blue Cross and Blue Shield Association, is the largest customer-owned and not-for-profit health insurer and fourth largest health insurer overall in the United States.

    Our Data Science organization touches every aspect of our business, from claims processing to customer service to care management. Likewise our portfolio employs a broad range of methodologies - from canonical tasks like using medical history to predict disease progression to applying NLP to doctors’ notes or leveraging deep learning on medical imaging. We’re hiring to expand our capabilities across the spectrum of Data Science roles. Come join us for the opportunity to work with massive datasets to drive revenue growth, improve our operational and member-facing processes, and affect how healthcare is delivered to our members.

    Click here to check out our relevant job postings!

        Want to post a job here? Email us for details >>


Training & Resources

  • High-performance speech recognition with no supervision at all
    We developed wav2vec Unsupervised (wav2vec-U), a way to build speech recognition systems that require no transcribed data at all. It rivals the performance of the best supervised models from only a few years ago, which were trained on nearly 1,000 hours of transcribed speech. We’ve tested wav2vec-U with languages such as Swahili and Tatar, which do not currently have high-quality speech recognition models available because they lack extensive collections of labeled training data...
  • What Is Logistic Regression?
    Logistic Regression - It is a little counterintuitive, but Logistic Regression is typically used as a classifier...This tutorial is on the basics of applying logistic regression, using a little bit of Python...




  • Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

    Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

    P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian
Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.