Receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe at any time. Your e-mail address is safe.

Data Science Weekly Newsletter
Issue
38
August 14, 2014

Editor's Picks

  • So You Wanna Try Deep Learning?
    I’m keeping this post quick and dirty, but at least it’s out there. The gist of this post is that I put out a one file gist that does all the basics, so that you can play around with it yourself...
  • Scholar Octopus
    Fun hack: I took 7200 papers from 34 CV/ML conferences, and layed them out with t-SNE based on bigram tfidf. Explore...
  • Is HBase’s slow and steady approach winning the NoSQL race?
    In the world of NoSQL databases, the products that have dominated the conversation are MongoDB and DataStax Enterprise, a leading distribution of Apache Cassandra. But a couple of headlines this week bring into focus a perhaps less-splashy, though rather tenacious player: Apache HBase, which is included with most major Hadoop distributions...



Data Science Articles & Videos

  • Building a Production Machine Learning Infrastructure
    Josh Wills, Director of Data Science at Cloudera has a gift for making fairly complicated technology explanations very digestible to the novice and intermediary techie. What I most love about this video is how Josh explains -very clearly – the issue of translating analytics Machine Learning on a large set of data records (see: individuals) and making it work in a production environment on one individual (think eCommerce)...
  • Using scikit-learn Pipelines and FeatureUnions
    Since I posted a postmortem of my entry to Kaggle's See Click Fix competition, I've meant to keep sharing things that I learn as I improve my machine learning skills. One that I've been meaning to share is scikit-learn's pipeline module. The following is a moderately detailed explanation and a few examples of how I use pipelining when I work on competitions...
  • An Empirical Analysis of Stop-and-Frisk in New York City
    Between 2006 and 2012, the New York City Police Department made roughly four million stops as part of the city’s controversial stop-and-frisk program. We empirically study two aspects of the program by analyzing a large public dataset released by the police department that records all documented stops in the city...
  • Interfaces, Efficiency and Big Data
    The recording of John Chambers' keynote presentation from the useR! 2014 conference, Interfaces, Efficiency and Big Data, is now available for viewing thanks to Data Science LA...
  • The Top 5 Questions A Data Scientist Should Ask During a Job Interview
    The data science job market is hot and an incredible number of companies, large and small are advertising a desperate need for talent. Before jumping on the first 6-figure offer you get, it would be wise to ask the penetrating questions below to make sure that the seemingly golden opportunity in front of you isn’t actually pyrite...
  • The Question to Ask Before Hiring a Data Scientist
    When hiring data scientists, there’s nothing more frustrating than making the wrong hire. Data scientists are in notoriously high demand, hard to attract, and command large salaries — compounding the cost of a mistake...
  • Visualizing product relationships in a Market Basket analysis
    I came up with this technique to visualize and explain market basket analysis in very simple visualization. This was the core thought behind this technique: Algorithms used in Text mining can be leveraged to create relationship plots in a Market basket analysis...



Jobs

  • Data Scientist - zulily - Seattle
    zulily is seeking an intellectually curious, collaborative data expert to work as an acquisition-focused data scientist and statistician. As a zulily Data Scientist, you will use statistical analysis and machine learning to better understand how users engage with zulily, and you will use that information to build models that inform our retention and acquisition practices, recommender systems, and optimize content. You should have a strong background in statistics and probability, machine learning, and working with large datasets. Additionally, you should have knowledge of and experience in online marketing practices and metrics...


Training & Resources

  • Data Science at the Command Line - Webcast
    We data scientists love to create exciting data visualizations and insightful statistical models. However, before we get to that point, usually much effort goes into obtaining, scrubbing, and exploring the required data. The command line, although invented decades ago, is an amazing environment for performing such data science tasks...
  • Seaborn
    Statistical data visualization using matplotlib...


Books


  • Data Analysis Using Regression and Multilevel/Hierarchical Models
    Comprehensive manual in accessible style...
    "Andrew Gelman is a top researcher in Bayesian statistics as well as an excellent writer. He has written an excellent text on Bayesian data analysis that uses the Markov Chain Monte Carlo methods for dealing with hierarchical linear models. This book starts out on an introductory level covering a wide variety of statistical modeling problems including logistic regression and multilevel logistic regression, generalized linear models and causal inference..."...


Easy to unsubscribe at any time. Your e-mail address is safe.