Hi Trey, firstly thank you for the interview. Let's start with your background.
Q - What is your 30 second bio?
A - As you mentioned, I'm a Data Scientist at zulily, a site offering daily deals for moms, babies, and kids. I've spent most of my adult life as a quantitative and computational social scientist. I'm also a huge sports fan and really want to advance the state of sports analytics and statistics.
Q - How did you get interested in working with data?
A - It's hard to say, though I was hooked after my first statistics class as an undergrad, and I've always loved computers and hacking. My first computer was a Commodore VIC-20.
Q - Was there a specific "aha" moment when you realized the power of data?
A - Unsupervised learning techniques have always seemed kind of magical to me, whether a simple clustering algorithm or more complicated methods like latent Dirichlet allocation. The idea that you can discover structure in a pile of data without telling the algorithm what you're looking for is pretty amazing. I think once I started working with more unstructured text I realized there was a whole other level of power available here
Trey, very interesting background. Thank you for sharing. Next, let's talk more about Football Analytics and what you are working on with the spread.
Q - Why are you excited about bringing Data Science and Football together?
A - A large part of Football Analytics is conducted by self-taught hobbyists - which is amazing and really makes the community lively and passionate. The down side is that I see a lot of wheel-reinventing and a lot of ad hoc, arbitrary decisions in work. It's not uncommon to see things like "I've included all players who started more than 10 games, have more than 4 years in the league, didn't miss time due to injury, and stayed with one team the entire time." This often amounts to selecting on the dependent variable and biases your results. I think a lot of people don't realize that many of the problems in sports analytics are just specific substantive examples of commonly occurring modeling problems. I'm hoping to change this.
Q - What are the biggest areas of opportunity / questions you want to tackle?
A - A big, low-hanging fruit that I see is the explicit incorporation of uncertainty into estimates of things like win probabilities. Data Scientists encounter this problem all the time - we need to provide decision-makers with succinct, often single-number summaries that can be used to take action. But we also want to express how confident we are about those summaries and estimates.
Q - What project(s) are you working on at the moment?
A - Right now I'm working on two projects, one using ensemble models (random forests, gradient boosted classifiers, etc.) to build a win probability model and then building a Bayesian model of so-called 'field goal range' that gives us better estimates of kicking success.
Q - Tell us a little more about the spread - what are your goals for the site?
A - First and foremost, I want it to be fun for myself and for readers -- it's a hobby. Besides that, my goals are to a) improve the state of Football Analytics by offering a different perspective on some commonly explored questions and b) to teach some people some basic data science methods. Sports provide lots of great teaching cases for explaining the reasoning behind some common modeling problems. So, I hope it's educational and causes people to think.
Very interesting - look forward to learning from both those projects - and having some fun along the way! Let's talk briefly about your work at zulily and helpful resources...
Q - What does a typical day at zulily look like for you?
A - We're a fast-paced organization and my role covers a lot of different areas. I get the opportunity work on a diverse set of ever-changing projects. A given day could range from tackling more traditional business statistical problems to sketching out the math behind an algorithm with the engineers to teaching seminars on statistics to non-experts in the company.
Q - What publications, websites, blogs, conferences and/or books do you read/attend that are helpful to your work?
A - On the data science side, the value of the connections I've made via Twitter really can't be understated. I've made professional connections, personal friends, and have an always-on network of frighteningly smart people who are always willing to help answer a question. I'd say that John Myles White and Drew Conway deserve special mention here. When I started getting to know them, they were both grad students in the social and behavioral sciences like myself. Their book, Machine Learning for Hackers, explains a lot of complicated topics in machine learning while being fun and conversational.
Interesting that Twitter has been proven such a valuable connector - good to keep in mind! Finally let's talk about the future and where you think your field is headed...
Q - What does the future of the spread and/or Football Analytics look like?
A - This is a great question. I don't know, but I hope it's a more transparent, peer-reviewed future with lots of collaboration. I'm a firm believe that we all improve when we make our methods transparent and open to critique. That being said, sports is a business with extremely high stakes and there's a tension there. I think that as analyses become more complicated, the role of data visualization will become much more important in conveying lots of information in an easy-to-understand fashion. Have you seen the laminated play sheets that coaches have on the sidelines? They're not nicknamed "Denny's menus" for nothing.
Q - That's funny! … Finally, how about any words of wisdom for Data Science students or practitioners starting out?
A - I'd say to pick a data set or sets you know really well and explore it like crazy. It's really helpful to be able to apply a new method to a dataset and have the ability to assess the face validity of your findings. It's fun to get counter-intuitive findings, but you should really stop and check your work if somehow you find that Ryan Leaf is actually a better quarterback than Peyton Manning. Examples that use uninteresting data (iris anyone?) are a lot less likely to result in you going the extra mile to learn more and exploring after the lesson is over.
I'd also say not to get too discouraged. This stuff is hard and it takes a lot of practice and a lot of willingness to make mistakes and be wrong before you get it right. And, if I had one single piece of advice -- take more matrix algebra.
Trey - Thank you so much for your time! Really enjoyed learning more about the convergence of Data Science and Football and what you are building at the spread. the spread can be found online at http://thespread.us and Trey Causey @treycausey.
Readers, thanks for joining us!
P.S.If you enjoyed this interview and want to learn more about
- what it takes to become a data scientist
- what skills do I need
- what type of work is currently being done in the field