Data Science Weekly Newsletter

Issue

402

August 5, 2021

‍

Editor's Picks

‍

Visualizing a codebase
How can we “fingerprint” a codebase to see its structure at a glance? Let’s explore ways to automatically visualize a GitHub repo, and how that could be useful...

Growing open-source - from Torch to PyTorch
Most small open-source projects, after enough effort and involvement, think about growth. At that point, they’ve nailed down their core interests and philosophies which are a foundation for their technical and cultural stack. Next, they’re wondering if they are doing the best they can to sell, market and grow this project...In this note, I talk about four aspects that are useful to delibrately outline as you grow from this stage of a project. I talk through these aspects via stories from my journey from Torch to PyTorch: a) Philosophy / Principles, b) Scope & Risk, c) Measurement, d) Scaling of the project...

How hard is it to get counting right?
In other words, it was familiarity with the data-generating process that enabled the lab group to imagine this potential vulnerability and come up with this experiment. By the time the data gets into the hands of analysts, it’s too late to fix. You can’t math your way out of a wrong number. This mistake was caught only because it was the same people generating the data as analyzing it. Which, great for ecology - but as data science becomes more and more specialized, it will be increasingly done by people who are explicitly and solely data scientists. And they’ll inherit datasets from repositories somewhere and never catch a single one of these systemic errors because they couldn’t sift through the wet mouse droppings even if they wanted to...

‍

A Message From This Week's Sponsor

‍

The Vector Database Pinecone is a fully managed vector database that makes it easy to add vector similarity search to production applications. It combines state-of-the-art vector search libraries, advanced features such as live index updates, and distributed infrastructure to provide high performance and reliability at any scale. No more hassles of benchmarking and tuning algorithms or building and maintaining infrastructure for vector search. Advanced ML teams use vector search to drastically improve results for semantic text search, image/audio search, recommendation systems, feed ranking, abuse/fraud detection, deduplication, and other applications. 3 reasons to try Pinecone:

It's production-ready: Go to production with a few lines of code, without breaking a sweat or slowing down.
It's scalable and high-performing: Search through billions of vectors in tens of milliseconds.
It's fully managed: We obsess over operations and security so you don't have to.

Try Pinecone now for free → PS — Get a free t-shirt after you run your first query!

‍

Data Science Articles & Videos

‍

r/Robotics Showcase
We are excited to announce the 1st Reddit Robotics Showcase!...We showcase the multitude of projects underway in the r/Robotics Reddit community. We have nearly 30 presentations from members of the r/robotics community, ranging from hobbyists, professionals, academics, industrial, students, etc. We have four main categories: Simulation, Mobile Robots, Manipulation, and Legged Robots. The showcase is free and online, and is held on July 31st and August 1st. It will be livestreamed via the Reddit Robotics Showcase youtube channel, you can find the video streams here...

How to build an AI unicorn in 6 years
Today, Tractable is worth $1 billion. Our AI is used by millions of people across America, Asia and Europe to recover faster from road accidents. It helps recycle as many cars as Tesla put on the road in 2019. And yet 6 years ago, Tractable was just me and Raz, two college grads coding in a London basement. A year before that I knew nothing about tech. If it’s happened to me, it can happen to others, so here’s the story & learnings...

Vision Transformer with Progressive Sampling
Transformers with powerful global relation modeling abilities have been introduced to fundamental computer vision tasks recently. As a typical example, the Vision Transformer (ViT) directly applies a pure transformer architecture on image classification, by simply splitting images into tokens with a fixed length, and employing transformers to learn relations between these tokens. However, such naive tokenization could destruct object structures, assign grids to uninterested regions such as background, and introduce interference signals. To mitigate the above issues, in this paper, we propose an iterative and progressive sampling strategy to locate discriminative regions...

Everyone in Your Organization Needs to Understand AI Ethics
When most organizations think about AI ethics, they often overlook some of the sources of greatest risk: procurement officers, senior leaders who lack the expertise to vet ethical risk in AI projects, and data scientists and engineers who don’t understand the ethical risks of AI. Fixing this requires both awareness and buy-in on your AI ethics program across the organization. To achieve this, consider these six strategies: 1) remove the fear of not getting it right away, 2) tailor your message to your audience, 3) tie your efforts to your company purpose, 4) define what ethics means in an operational way, 5) lean on trusted and influential individuals, and 6) never stop educating...

How I failed machine learning in medical imaging - shortcomings and recommendations
Medical imaging is an important research field with many opportunities for improving patients’ health. However, there are a number of challenges that are slowing down the progress of the field as a whole, such optimizing for publication. In this paper we reviewed several problems related to choosing datasets, methods, evaluation metrics, and publication strategies. With a review of literature and our own analysis, we show that at every step, potential biases can creep in. On a positive note, we also see that initiatives to counteract these problems are already being started. Finally we provide a broad range of recommendations on how to further these address problems in the future...

Introducing droidlet, a one-stop shop for modularly building intelligent agents
To help researchers and even hobbyists to build more intelligent real-world robots, we’ve created and have open-sourced the droidlet platform...Droidlet is a modular, heterogeneous embodied agent architecture, and a platform for building embodied agents, that sits at the intersection of natural language processing, computer vision, and robotics. It simplifies integrating a wide range of state-of-the-art machine learning (ML) algorithms in embodied systems and robotics to facilitate rapid prototyping...

How PostgreSQL aggregation works and how it inspired our hyperfunctions’ design
Get a primer on PostgreSQL aggregation, how PostgreSQL’s implementation inspired us as we built TimescaleDB hyperfunctions and its integrations with advanced TimescaleDB features – and what this means for developers...

I Know a Place: Beauty and solace in the abandoned worlds of Roblox
Despite the incredible popularity of the bigger games, the majority of Roblox places sit empty. The service has been running since 2006 and has accumulated almost 7 billion places, although far fewer of these places become public-facing “experiences.”...Even now, being online is to watch history accumulate in real time. Every timeline suggests a timeline of yesterday, or of the day before yesterday, all stacked in layers and stored neatly in a server farm somewhere until that particular service collapses under the weight, subjecting internet historians to a panicked flurry of attempted archival activity before all records disappear into screenshots and fuzzy memories...

The Data Exchange Podcast: Sean Taylor on how data science and the role of data scientists have changed over the years
This week our managing editor Jenn Webb and I speak with Sean Taylor, Data Science Manager at Lyft. Sean was previously a research scientist and manager at Facebook where he was instrumental in the creation and release of Prophet, a very popular open source library for time-series forecasting...

The one data platform to rule them all… but according to whom?
recently starting as an investor at Founders Fund and seeing so many “data platforms,” I decided to do a deeper dive into where each excelled, as well as how they interacted with each other. I think the clearest way to view the ecosystem is through what user persona each tool is implicitly targeting by making that persona’s day 10x better. Perhaps the tool sells into another persona, but I’m most curious about the tool’s true “champion”, the people who can’t imagine their life without it...

‍

Tools

‍

Retool is the fastest way to build internal tools. As developers, we realized that all internal tools are made up of the same building blocks: tables, drop-downs, buttons, text inputs, etc. So, we built a drag-and-drop interface that makes it super easy to build internal tool UIs. All with prebuilt database connectors, and the ability to customize every aspect of code with JavaScript. Companies like DoorDash, Amazon and Brex use Retool to build internal tools super fast. Don't waste hours searching for React components and wrangling data sources and APIs! Try Retool instead. *Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Senior Data Analyst - HER - Remote

We are looking for a Senior Data Analyst to help us re-develop our existing data workflow, enable better scalability, and improve accuracy. In addition to this, we’re looking for someone to help improve our ability to discover the relevant information in our data, driving our decisions in delivering an ever improving service.

The primary focus of the role will be in establishing a new data gathering pipeline, doing statistical analysis, and helping build the analytical basis for the prediction systems. This is the perfect opportunity to be intricately involved in running analytical experiments in a methodical manner, and give us a hand in improving the next generation of recommendation systems that power our social experience.

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

Experimenting with CLIP+VQGAN to Create AI Generated Art
I used a Google Colab notebook that makes it really easy to experiment with CLIP+VQGAN in a visual way. You just update the tunable parameters in the UI, hit "run", and away it goes generating images and progressively steering the outputs towards your target prompts...I experimented with our Raccoon Driving a Tractor prompt from paint.wtf...

An attempt at demystifying graph deep learning
There are a ton of great explainers of what graph neural networks are. However, I find that a lot of them go pretty deep into the math pretty quickly. Yet, we still are faced with that age-old problem: where are all the pics?? As such, just as I had attempted with Bayesian deep learning, I'd like to try to demystify graph deep learning as well, using every tool I have at my disposal to minimize the number of equations and maximize intuition using pictures. Here's my attempt, I hope you find it useful!...

One weird trick to shrink convolutional networks for TinyML
The summary is that if you have MaxPool or AveragePool after a convolutional layer in a network, and you’re targeting a resource-constrained system like a microcontroller, you should try removing them entirely and replacing them with a stride in the convolution instead. This has two main benefits, but to explain it’s easiest to diagram out the network before and after...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍