Receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe at any time. Your e-mail address is safe.

Data Science Weekly Newsletter
Issue
140
July 28, 2016

Editor's Picks

  • How I built a Slack bot to help me find an apartment in San Francisco
    I moved from Boston to the Bay Area a few months ago. Priya (my girlfriend) and I heard all sorts of horror stories about the rental market. The fact that searching for “How to find an apartment in San Francisco” on Google yields dozens of pages of advice is a good indicator that apartment hunting is a painful process...
  • Approaching (Almost) Any Machine Learning Problem
    An average data scientist deals with loads of data daily. Some say over 60-70% time is spent in data cleaning, munging and bringing data to a suitable format such that machine learning models can be applied on that data. This post focuses on the second part, i.e., applying machine learning models, including the preprocessing steps. The pipelines discussed in this post come as a result of over a hundred machine learning competitions that I’ve taken part in...



A Message From This Week's Sponsor


  • Where science and policy change the world. And You.

    Apply your knowledge & skills to federal policy via the AAAS Science & Technology Policy Fellowships. A year-long professional development opportunity for doctoral level data scientists to serve in the federal government in Washington, D.C.
    STPF fosters a career-enhancing network of science leaders who understand policymaking & contribute to society...


Data Science Articles & Videos

  • Fly the Frustrating Skies: Sentiment Analysis of the U.S. Airline Industry
    For this project, I compared the performance of various sentiment analyzers in order to identify the most effective strategy for identifying customer complaints directed towards the Twitter accounts of 6 major U.S. airlines (United, American Airlines, U.S. Airways, Southwest, Delta, and Virgin America)...
  • Applying Data Science To The Supreme Court: Topic Modeling Over Time With NMF (and A D3.js Bonus)
    The Supreme Court is arguably the most important branch of government for guiding our future, but it's incredibly difficult for the average American to get a grasp of what's happening. I decided that a good start in closing this gap would be to model topics over time and create an interactive visualization for anyone with an interest and an internet connection to utilize to educate themselves...
  • NYC Subway Math
    I started tracking all subway trains one day and completely forgot about it. Several weeks later I had a 3GB large data dump full of all the arrivals for 1, 2, 3, 4, 5, 6, L, SI and GC (the latter two being Staten Island railway and Grand Central Shuttle). Let’s do some cool stuff with this data!...
  • The Skynet Salesman
    Operations covers a broad range of problems and can involve things like optimizing shipping, allocating items to warehouses, coordinating processes to ensure that our products arrive on time, or optimizing the internal workings of a warehouse. One of the canonical questions in operations is the traveling salesman problem (TSP). In its simplest form, we have a busy salesperson who must visit a set number of locations once...
  • Language modeling a billion words
    In this Torch blog post, we use noise contrastive estimation (NCE) [2] to train a multi-GPU recurrent neural network language model (RNNLM) on the Google billion words (GBW) dataset...
  • Why I’m Not a Fan of R-Squared
    R2 answers the question: “does my model perform better than a constant model?” But we often would like to answer a very different question: “does my model perform worse than the true model?”...
  • Degrees Of Separation On A Tree Algorithm
    Imagine that you are like me and looking for an algorithm to compute degrees of separation along a hierarchical tree like the ones pictured below. Your tree could represent any data — let’s pretend it’s a company orgchart with node A as the CEO, nodes B and I are execs, and so on. The distance, or degrees of separation, between any two nodes is the number of links along the shortest path that separates them...



Jobs

  • Applied Data Scientist - Ancestry - Lehi, UT
    AncestryDNA, is the world's largest consumer genomics database providing consumers insights into their ancestral origins. The service enables customers to not only uncover their ethnic mix and rich family stories, but discover distant relatives with a common ancestral match, and help solve the toughest family mysteries.The Data Science team is looking for an experienced Data Scientist who has a passion to build data products and data systems...


Training & Resources

  • Generalized linear models, abridged.
    This note grew out of our own desire to better understand the numerics of generalized linear models. We highlight aspects of GLM implementations that we find particularly interesting. We present some reference implementations stripped down to illuminate core ideas; often with just a few lines of code...


Books



Easy to unsubscribe at any time. Your e-mail address is safe.