Data Science Weekly Newsletter - Issue 433

Issue #400

July 22 2021

Editor Picks
  • Roadmap: Data Infrastructure (Bessemer Venture Partners Report)
    The modern cloud data stack is undergoing massive construction and the future of software will be defined by the accessibility and use of data...This guide is a starting point for our investments in the data stack as a separate category. There are multiple massive businesses that have already been developed and others waiting to be started by the right founders as every role is empowered through the new, accessible, modern data stack...
  • Why Deep Learning Works Even Though It Shouldn’t
    I find that people from a statistics background tend to throw up their hands at deep learning, because from a traditional statistics perspective, none of it can possibly work. This makes it very frustrating that it does. As a result they tend to have a much more dim view of its results and methods than their continued success warrants, so I hope here that I can bridge some of that gap...The key thing I’m going to try to intuitively explain is why models always get better when they are bigger and deeper, even when the amount of data they consume stays the same or gets smaller...

A Message from this week's Sponsor:


The Vector Database

Pinecone is a fully managed vector database that makes it easy to add vector similarity search to production applications. It combines state-of-the-art vector search libraries, advanced features such as live index updates, and distributed infrastructure to provide high performance and reliability at any scale. No more hassles of benchmarking and tuning algorithms or building and maintaining infrastructure for vector search.

Advanced ML teams use vector search to drastically improve results for semantic text search, image/audio search, recommendation systems, feed ranking, abuse/fraud detection, deduplication, and other applications.

3 reasons to try Pinecone:
  • It's production-ready: Go to production with a few lines of code, without breaking a sweat or slowing down.
  • It's scalable and high-performing: Search through billions of vectors in tens of milliseconds.
  • It's fully managed: We obsess over operations and security so you don't have to.
Try Pinecone now for free →

PS — Get a free t-shirt after you run your first query!

Data Science Articles & Videos

  • Is GitHub Copilot a blessing, or a curse?
    GitHub Copilot is a new service from GitHub and OpenAI, described as “Your AI pair programmer”. It is a plugin to Visual Studio Code which auto-generates code for you based on the contents of the current file, and your current cursor location...It really feels quite magical to use. For example, here I’ve typed the name and docstring of a function which should “Write text to file fname”...The grey body of the function has been entirely written for me by Copilot! I just hit on my keyboard, and the suggestion gets accepted and inserted into my code...
  • GitHub Copilot: First Impressions
    I’ve been lucky enough to use it for the past few weeks and so far has proven quite useful, having earned a place in my toolbox despite its rough edges. I also feel it signals a coming change in how we develop and reason about systems, a change which will allow us to go up a few layers of abstraction in the coming decades....
  • OpenAI disbands its robotics research team
    OpenAI has disbanded its robotics team after years of research into machines that can learn to perform tasks like solving a Rubik’s turns out that we can make a gigantic progress whenever we have access to data. And I kept all of our machinery unsupervised, [using] reinforcement learning — [it] work[s] extremely well. There [are] actually plenty of domains that are very, very rich with data. And ultimately that was holding us back in terms of robotics...
  • Data vs classifiers, who wins?
    The classification experiments covered by machine learning (ML) are composed by two important parts: the data and the algorithm. As they are a fundamental part of the problem, both must be considered when evaluating a model's performance against a benchmark...This work presents a new evaluation methodology based on IRT and Glicko-2, jointly with the decodIRT tool developed to guide the estimation of IRT in ML. It explores the IRT as a tool to evaluate the OpenML-CC18 benchmark for its algorithmic evaluation capability and checks if there is a subset of datasets more efficient than the original benchmark...
  • Machine learning in a hurry: what I've learned from the SLICED ML competition
    This summer I’ve been competing in the SLICED machine learning competition, where contestants have two hours to open a new dataset, build a predictive model, and be scored as a Kaggle submission. Contestants are graded primarily on model performance, but also get points for visualization and storytelling, and from audience votes. Before SLICED I had almost no experience with competitive ML, so I learned a lot!...
  • The Bicycle Network Improvement Problem: Optimization Algorithms and A Case Study in Atlanta
    This paper presents a method to find the best way to improve the safety of a bicycle network for a given budget and maximize the number of riders that could now choose bicycles for their commuting needs. This optimization problem is formalized as the Bicycle Network Improvement Problem (BNIP): it selects which roads to improve for a set of traveler origin-destination pairs, taking both safety and travel distance into account. The BNIP is modeled as a mixed-integer linear program that minimizes a piecewise linear penalty function of route deviations of travelers. The MIP is solved using Benders decomposition to scale to large instances. The paper also presents an in-depth case study for the Midtown area in Atlanta, GA, using actual transportation data...
  • Better computer vision models by combining Transformers and convolutional neural networks
    We’ve developed a new computer vision model called ConViT, which combines two widely used AI architectures — convolutional neural networks (CNNs) and Transformer-based models — in order to overcome some important limitations of each approach on its own. By leveraging both techniques, this vision Transformer-based model can outperform existing architectures, especially in the low data regime, while achieving similar performance in the large data setting...
  • Deep Learning on a Data Diet: Finding Important Examples Early in Training
    The recent success of deep learning has partially been driven by training increasingly overparametrized networks on ever larger datasets. It is therefore natural to ask: how much of the data is superfluous, which examples are important for generalization, and how do we find them? In this work, we make the striking observation that, on standard vision benchmarks, the initial loss gradient norm of individual training examples, averaged over several weight initializations, can be used to identify a smaller set of training data that is important for generalization...
  • PIX from DeepMind
    PIX is an image processing library in JAX, for JAX...JAX is a library resulting from the union of Autograd and XLA for high-performance machine learning research. It provides NumPy, SciPy, automatic differentiation and first-class GPU/TPU support...PIX is a library built on top of JAX with the goal of providing image processing functions and tools to JAX in a way that they can be optimised and parallelised through jax.jit, jax.vmap and jax.pmap...




Online Data Science Programs from Drexel University

Find your algorithm for success with an online data science degree from Drexel University. Gain essential skills in tool creation and development, data and text mining, trend identification, and data manipulation and summarization by using leading industry technology to apply to your career. Learn more.

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!



  • Senior Data Scientist - WarnerMedia - New York, NY

    WarnerMedia is a leading media and entertainment company that creates and distributes premium and popular content from a diverse array of talented storytellers and journalists to global audiences through its consumer brands including: HBO, HBO Max, Warner Bros., TNT, TBS, truTV, CNN, DC Entertainment, New Line, Cartoon Network, Adult Swim, Turner Classic Movies and others.

    Reporting to the Sr. Manager, Data Science this role will help to develop the predictive insights and prescriptive capabilities behind CNN’s emerging products, transforming first- and third- party data into quantitative findings, visualizations, and automation

        Want to post a job here? Email us for details >>


Training & Resources

  • 30 Days of ML: Machine learning beginner → Kaggle competitor in 30 days. Non-coders welcome.
    In the first 2 weeks, you’ll receive hands-on assignments delivered to your inbox. The goal of these assignments is to rapidly cover the most essential skills needed to get your hands dirty with data. You'll start by learning how to code in Python and quickly learn how to build your first machine learning model...After tackling these core concepts, you’ll be invited to a super fun, beginner-friendly Kaggle machine learning competition to test your knowledge. Through practice, you’ll explore the best ways to use Kaggle as a learning resource and connect with other data scientists...On top of all of this, you’ll have the opportunity to attend elective workshops and seminars hosted by data scientists from Google's Developer Experts Program...
  • Useful Algorithms That Are Not Optimized By Jax, PyTorch, or Tensorflow
    In some previous blog posts we described in details how one can generalize automatic differentiation to give automatically stability enhancements and all sorts of other niceties by incorporating graph transformations into code generation. However, one of the things which we didn't go into too much is the limitation of these types of algorithms. This limitation is what we have termed "quasi-static" which is the property that an algorithm can be reinterpreted as some static algorithm. It turns out that for very fundamental reasons, this is the same limitation that some major machine learning frameworks impose on the code that they can fully optimize, such as Jax or Tensorflow. This led us to the question: are there algorithms which are not optimizable within this mindset, and why? The answer is now published at ICML 2021, so lets dig into this higher level concept...
  • Introducing the Data Validation Tool for EDW migrations
    Data validation is a crucial step in data warehouse, database, or data lake migration projects. It involves comparing structured or semi-structured data from the source and target tables and verifying that they match after each migration step (e.g data and schema migration, SQL script translation, ETL migration, etc.)...Today, we are excited to announce the Data Validation Tool (DVT), an open sourced Python CLI tool that provides an automated and repeatable solution for validation across different environments. The tool uses the Ibis framework to connect to a large number of data sources including BigQuery, Cloud Spanner, Cloud SQL, Teradata, and more...




  • Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

    Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

    P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian
Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.