Building a Data Science "Experiment Platform": Nick Elprin Interview



Data Science Weekly Interview with Nick Elprin, Founder of Domino Data Lab on Data Science as a Service

We recently caught up with Nick Elprin, Founder of Domino Data Lab. We were keen to learn more about his background, his thoughts on Data Science as a Service and the functionality he has built at Domino…

Hi Nick, firstly thank you for the interview. Let's start with your background and how you became interested in working with data...

Q - What is your 30 second bio?
A - Before I founded Domino Data Lab, I spent about seven years building analytical software tools at a large hedge fund. As you can imagine, economic modeling at that scale requires a lot of sophisticated data analysis, so we built some interesting technology. Before that, I studied computer science at Harvard.

Q - How did you get interested in working with data?
A - Well, I was interested in working with software, and building software to solve interesting problems. Then it turned out that there are a lot of interesting problems around data and data science that demand help from software.

Q - Was there a specific "aha" moment when you realized the power of data?
A - My first job out of college was at an algorithmic hedge fund. We would process hundreds of thousands of data series to predict market movements. That was certainly a really powerful example of how you can use data.

Q - I can imagine! … Let's talk a little about the evolving field of Data Science - how have things changed over the past 5 years?
A - It has scaled up, along every dimension: bigger data sets, more sophisticated analytical techniques, large teams working together, and a wider range of problems that now seem like good opportunities for applying data science. As companies transitioned from “we should collect lots of data” to “we should _do something_ with all our data,” the job of data scientist became much more demanding.

Q - How has the life of a Data Scientist changed as a result?
A - The variety of skills needed is really overwhelming, resulting in “unicorn-like” job descriptions for data scientists. One thing we see a lot is that data scientists have to do software engineering to build tools for themselves. This happens even within companies with strong engineering teams, because the engineers are all dedicated to working on the product (or something else) rather than providing support to data scientists.

Q - What range of tools / platforms are being developed to support this evolution?
A - There are lots of tools to address specific parts of a data science workflow. For example, there are tools that make it easier to do data cleaning; tools that make it easier to manage and explore big data sets; lots of great libraries in Python and R for specific data science techniques; lots of great tools for visualization and reporting. But nobody is really stepping back and saying, “you know, it doesn’t make sense that we’re asking our data scientists to do so many different things.” With Domino, we’re trying to cut a lot of the engineering “schlep” out of the entire analytical lifecycle, from model development all the way to deployment.

Q - Got it. So what is the key enabler of Data Science as a Service?
A - One of the things we think is really important is supporting the data scientist in the way they want to work rather than trying to change it. You see products that expect users to change the language they use or the workflow or something like that. At Domino, we really try to minimize any impact on how the data scientist wants to work. So for instance we support R, Python, Matlab, Julia, etc; and our users work in the IDEs and tools they already use, not some new editor that we’ve built.


On that note, let's switch gears and talk about Domino in more detail...

Q - What specific problem does Domino Data Lab solve? How would you describe it to someone not familiar with it?
A - I like to describe Domino as an "experiment platform": it lets data scientists improve their analyses faster by making it easy to run, track/reproduce, share, and deploy analytical models. Normally these capabilities would require a lot of engineering work and hassle to build and maintain, but Domino gives you these “power tools” out of the box.

That’s the short version. For a longer version, it’s easiest to just describe Domino’s main areas of functionality:

  1. Domino lets you move your long-running or resource-intensive compute tasks off your machine onto powerful hardware with “one click" (either in the cloud, or on a cluster behind your company’s firewall). And you can run as many compute tasks as you want in parallel. So instead of being limited by your desktop or laptop, you can run more experiments in parallel across an unlimited number of machines. It’s basically the simplest way to get access to an industrial strength compute cluster.
  2. Every time you run your code, Domino automatically keeps a snapshot of your work — including your data sets, and the results you produce — so you can reproduce past work and always have a record of how your analysis has evolved. This is critical to analytical workflows, which tend to be highly iterative and exploratory.
  3. Because Domino tracks and organizes your work centrally, it’s easy to work with collaborators. It’s like Github for data science. Domino will keep your team updated as changes happen, and let you merge your work with other people’s.
  4. Finally, Domino lets you package up your analytical models for future use. You can put a UI around your model, so non-technical users can run your analysis without interacting with your code -- or bothering you -- at all. Or you can put a RESTful API interface on top of your model, so existing software systems can interact with it. Domino provides all the plumbing to let you “deploy” your model without any setup or hassle.

Q - That's a lot of functionality! What would you say are the main benefits / attributes of your platform?
A - I think the key attribute of the platform is the idea of centralizing your analysis. Moving analysis off of analysts’ desktops onto a central server unlocks a lot of power. For example, because Domino is a central hub for analysis, it scales out the hardware behind the scenes easily; it synchronizes and shares work across multiple people; and it can automatically track changes as you work. And at a higher level what all this means is that the data scientist gets to experiment faster … and therefore learn faster.

Q - That's great! What are some of the most interesting analysis/projects you have hosted?
A - Unfortunately I can’t say much about the most interesting ones, because most of our clients are doing proprietary work (some of which we don’t even know about). One of our customers uses machine learning to do spam detection in social media, which is interesting because the “vocabulary” changes rapidly: as new people and terminology enter the social media domain, models need to be updated and re-trained rapidly. Another one of our big customers is a car manufacturer that uses Domino to process data it collects from various sensors. More specifically, a reliability engineering team uses Domino to run analysis to improve the reliability of different parts of the car. It’s very rewarding to know we’re helping — if indirectly — with something as serious as that.


Thanks for sharing all that detail - very interesting! Good luck with everything you're doing at Domino! Finally, let's talk a bit about the future and share some advice … ...

Q - What does the future of Data Science look like?
A - I think one interesting question is, "how much can be automated?” I’ve seen several products that seem to promise something like, “upload your data and we’ll find the insights for you.” My personal view is that this is misguided; that there is a critical “human” element of this work that won’t be automated for a long time (until we develop a real artificial intelligence, I suppose). “Can we automate data science” is a bit like asking, “could we automate science.” That’s because, like any scientific or truth-seeking activity, the questions you ask, and the places you look, matter as much if not more than the techniques you use. And I don’t see software being able to ask insightful questions anytime soon.

So… my bet is that we will make tools better and better at augmenting — rather than replacing — human intelligence, understanding, inquisitiveness, and domain expertise. I think the key metric for measuring the progress of our data science tools is: what percentage of time are analysts spending on their core problem, rather than on distractions (e.g., connecting to data sources, waiting for code to run, configuring infrastructure). I don’t think that number will ever get to 100%, but it should get much higher than it is today.

Q - Any words of wisdom for Data Science students or practitioners starting out?
A - Get your hands dirty as much as you can. Trying things is the best way to learn.


Nick - Thank you ever so much for your time! Really enjoyed learning more about your background, your thoughts on Data Science as a Service and the functionality you have built at Domino. Good luck with all your ongoing projects!

Readers, thanks for joining us! If you want to know more, Domino can be found online here. and on twitter @DominoDataLab .


P.S.If you enjoyed this interview and want to learn more about
  • what it takes to become a data scientist
  • what skills do I need
  • what type of work is currently being done in the field
then check out Data Scientists at Work - a collection of 16 interviews with some the world's most influential and innovative data scientists, who each address all the above and more! :)

You might also enjoy these interviews because you are awesome:

Back

Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.