Using AWK and R to parse 25tb
Recently I was put in charge of setting up a workflow for dealing with a large amount of raw DNA sequencing (well technically a SNP chip) data for my lab. The goal was to be able to quickly get data for a given genetic location (called a SNP) for use for modeling etc. Using vanilla R and AWK I was able to cleanup and organize the data in a natural way, massively speeding up the querying. It certainly wasn’t easy and it took lots of iterations. This post is meant to help others avoid some of the same mistakes and show what did eventually work...
Evolving Alien Corals
A research project simulating the evolution of virtual corals. Corals are grown in underwater environments containing light and current flow and are evolved with a genetic-algorithm. Morphogens, signaling, memory and other biologically motivated capacities enable a multipurpose biomimetic form optimization engine. This work is part of a series of projects exploring emergent and generative forms...
A Message From This Week's Sponsor
Find A Data Science Job Through Vettery
Vettery specializes in tech roles and is completely free for job seekers. Interested? Submit your profile, and if accepted onto the platform, you can receive interview requests directly from top companies growing their data science teams.
Data Science Articles & Videos
An hyperparameter tuner for Keras, specifically for tf.keras with TensorFlow 2.0...
You don't need Kafka (really).
For those fortunate enough to never have worked with big data, Kafka is a very complex piece of distributed software that coordinates data transfer between multiple computers. More specifically, what it does is “flattens” data so that it can move it quickly from one place to the other. Where you usually need Kafka is if you have a LOT of data that you need to process very quickly and send somewhere else...
Deep Set Prediction Networks
We study the problem of predicting a set from a feature vector with a deep neural network. Existing approaches ignore the set structure of the problem and suffer from discontinuity issues as a result. We propose a general model for predicting sets that properly respects the structure of sets and avoids this problem. With a single feature vector as input, we show that our model is able to auto-encode point sets, predict bounding boxes of the set of objects in an image, and predict the attributes of these objects in an image....
Learning Data Augmentation Strategies for Object Detection
We first demonstrate that data augmentation operations borrowed from image classification may be helpful for training detection models, but the improvement is limited. Thus, we investigate how learned, specialized data augmentation policies improve generalization performance for detection models. Importantly, these augmentation policies only affect training and leave a trained model unchanged during evaluation...
Innovations in Graph Representation Learning
Here we present the results of two recent papers on graph embedding: The first paper introduces a novel technique to learn multiple embeddings per node, enabling a better characterization of networks with overlapping communities. The second addresses the fundamental problem of hyperparameter tuning in graph embeddings, allowing one to easily deploy graph embeddings methods with less effort. We are also happy to announce that we have released the code for both papers in the Google Research github repository for graph embeddings...
Finding Success on Twitch
In this blog post, we’re going to walk through how we built a streamer recommender for Twitch and the various tools we used to make the resulting app, available now on Heroku...
Scaling a massive State-of-the-Art Deep Learning model in production
Last week, at Hugging Face, we launched a new groundbreaking text editor app. It’s different from traditional text editors in that an NLP model can complete your sentences if you ask it to, bringing a new dimension to “writing with a machine”. It’s based on GPT-2, OpenAI’s language model that can generate syntactically accurate sentences and coherent paragraphs of text. Here we offer to show the approach we took in order to scale these models and respond to the 10,000 unique users and the equivalent of more than a hundred books written we got in the first few days. We explain the thoughts that went into it, define the best fitting architecture for optimal processing and discuss what we could have improved on...
Data Scientist - Visiting Nurse Service of New York - New York
The Visiting Nurse Service of New York (VNSNY) is the nation’s largest not-for-profit home- and community-based health care organization, serving the five boroughs of New York City, and Nassau, Suffolk, and Westchester Counties. For 125 years, VNSNY has been committed to meeting the health care needs of New Yorkers with compassionate, high-quality home health care. We offer a wide range of services, programs, and health plans to meet the diverse needs of our patients, members, and clients from before birth to the end of life.
The Data Science Team provides advanced analytical support across VNSNY’s family of corporations. We leverage big data in a fast paced environment to support strategic decisions for the agency. Meaningful, appropriate use of data is central to the success of our organization. We are looking for an ambitious data scientist to join our expanding team...
Training & Resources
Get The PyTorch Variable Shape
Learn how to get the PyTorch Variable shape by using the PyTorch size operation, via a screencast video and full tutorial transcript...
the transformer … “explained”?
Okay, here’s my promised post on the Transformer architecture. The Transformer architecture is the hot new thing in machine learning, especially in NLP...
Guesstimation: Solving the World's Problems on the Back of a Cocktail Napkin
"Guesstimation enables anyone with basic math and science skills to estimate virtually anything--quickly--using plausible assumptions and elementary arithmetic"...
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page
P.S., Want to reach our audience / fellow readers? Consider sponsoring - grab a spot now; first come first served! All the best, Hannah & Sebastian