We recently caught up with Pete Warden, Co-Founder and CTO of Jetpac, which is using Big Data and Object Recognition to build a modern day Yelp...
Hi Pete, firstly thank you for the interview. Let's start with your background and the work going on in Object Recognition right now...
Q - What is your 30 second bio?
A - As you mentioned, I'm the CTO of Jetpac. I was born in Britain and am now living in San Francisco, I used to work for Apple, I've written some books on data for O'Reilly, and I blog at petewarden.com.
Q - What are the main "types" of problems being tackled in the Object Recognition space? Who are the big thought leaders?
A - There's an amazing amount of great research out there around recognizing objects in images, but there have been surprisingly few commercial applications. The biggest successes have been specialized facial recognition security applications, bar-code scanners like Occipital's Red Laser, and Google's image search. Everybody knows object recognition is a crucial foundational technology for the future, but because it's currently so unreliable it's been hard to build any consumer applications around it.
The problem is that object recognition is incredibly hard, and even the best algorithms make a lot of mistakes. If you're doing a search application, these mistakes mean a lot of bogus images showing up in the search results. The fortunate thing about Jetpac is that we have hundreds or thousands of photos of each place we feature, so we're able to derive data from applying our algorithms to all these samples. An algorithm that only spots a mustache 25% of the time would give a terrible experience if you were relying on it to deliver search results, but applying to a lot of photos at the same place gives you a reliable estimate of how many mustaches are present. Even if individual photos might be mis-identified the errors cancel out.
Q - What are the biggest areas of opportunity / questions you want to tackle?
A - Photos are data! That's the most exciting thing about what we're doing, once you're able to extract useful information about a place from a collection of photos taken there, all those billions of photos gathering digital dust on hard drives around the world turn into an incredible source of data. We'll be able to answer questions about pollution by analyzing the intensity of sunsets, spot smog in photos, build a much better picture of how people move around neighborhoods to help plan urban regeneration, there's an endless number of pressing problems this data can help with.
Q - What Data Science methods have you found most helpful?
A - My friend Monica Rogati likes to say that division is her favorite algorithm. I specialize in uncovering new information from discarded sources, mining neglected data exhaust, so most of the work I do is the initial extraction of useful features from apparently useless noise. Once I have the data, most of the analysis is fairly primitive database joins, sums, and division. We use machine learning, neural networks, and a lot of other fancy approaches to analyze the images, but Excel formulas are key too. A lot of people underestimate the usefulness of old-school data tools like spreadsheets.
Q - What are your favorite tools / applications to work with?
A - I have to give a plug to the Data Science Toolkit here. It's a custom virtual machine, available as a Vagrant box and an Amazon EC2 image, and it comes pre-installed with my favorite open source tools and data sets. It's focused on taking messy, unstructured data and turning it into something useful, so it has everything from geocoders, sentiment analysis, and document conversion, to entity extraction from text. There are a lot of amazing open-source tools out there, but they're often hard to install and interface with, so I wanted to make my personal favorites available in a turn-key way.
Pete, very interesting background and context - thank you for sharing! Next, let's talk more about what you are working on at Jetpac.
Q - How did you come to found Jetpac?
A - My co-founder Julian was using some of my open-source tools, and he was peppering me with questions. As soon as I talked with him, I realized how fantastic a source of data he was looking at in the hundreds of billions of social photos we're sharing.
Q - What specific problem is Jetpac trying to solve? How would you describe it to someone who is not familiar with it?
A - We help you discover fun places to go, both locally and when you're traveling. We aim to offer the kind of insights you'd get from a knowledgeable local friend about the best bars, hotels, restaurants. The information we get from the mass of pictures, and the pictures we present in the guide, combine to give you a much better idea of what a place is like than any review-based service.
Editor Note - If you are interested in more detail behind how Jetpac's technology works, Pete's recent blog article is very insightful. Here are a few highlights:
Image-based measurements - The most important information we pull out is from the image pixels. These tell us a lot about the places and people who are in the photos, especially since we have hundreds or thousands of pictures for most locations.
One very important difference between what we're doing with Big Data and traditional computer vision applications is that we can tolerate a lot more noise in our recognition tests. We're trying to analyze the properties of one object (a bar for example) based on hundreds of pictures taken there. That means we can afford to have some errors in whether we think an individual photo is a match, as long as the errors are random enough to cancel themselves out over those sort of sample sizes
Testing - Internally, we use a library of several thousand images that we've manually labeled with the attributes we care about as a development set to help us build our algorithms, and then a different set of a thousand or so to validate our results. All of the numbers are based on that training set, and I've included grids of one hundred random images to demonstrate the results visually.
We're interested in how well our algorithms correlate with the underlying property they're trying to measure, so we've been using the Matthews Correlation Coefficient (MCC) to evaluate how well they're performing. I considered using precision and recall, but these ignore all the negative results that are correctly rejected, which is the right approach for evaluating search results you're presenting to users, but isn't as useful as a correlation measurement for a binary classifier.
Example: Pictures of Plates = Foodies - We run an algorithm that looks for plates or cups taking up most of the photo. It's fairly picky, with a precision of 0.78, but a recall of just 0.15, and an MCC of 0.32. If a lot of people are taking photos of their meals or coffee, we assume that there's something remarkable about what's being served, and that it's popular with foodies.
Editor Note - Back to the interview!...
Q - What publications, websites, blogs, conferences and/or books are helpful to your work?
A - O'Reilly have been true pioneers in the data world, I recommend following their blog at http://radar.oreilly.com, and the Strata conference has always been a blast.
Very interesting - look forward to following Jetpac's progress! Finally, it is advice time!...
Q - Any words of wisdom for Data Science students or practitioners starting out?
A - Don't listen to old farts like me. Figure out how we're all doing it wrong, and show us! I'm looking forward to being rendered obsolete by a whole new generation with tools and insights that leave us in the dust. We really have only scratched the surface in what we can do with all the data we're generating, so be ambitious and attack problems everyone else is ignoring as too hard.
Pete - Thank you so much for your time! Really enjoyed learning more about Object Recognition and what you are building at Jetpac.
Jetpac can be found online at https://www.jetpac.com and Pete Warden @petewarden.
Readers, thanks for joining us!
P.S.If you enjoyed this interview and want to learn more about
then check out Data Scientists at Work - a collection of 16 interviews with some the world's most influential and innovative data scientists, who each address all the above and more! :)
Easy to unsubscribe at any time. Your e-mail address is safe.