We recently caught up with Szilard Pafka, Chief Scientist at Epoch and Founder of Data Science LA. We were keen to learn more about his background, his role building the LA Data Science community and his work at Epoch…
Hi Szilard, firstly thank you for the interview. Let's start with your background and how you became interested in working with data...
Q - What is your 30 second bio?
A - My primary field of study at university was Physics (BSc/MSc/PhD) with adventures in Computer Science (BSc) and Finance (MA). Like many people in Physics at that time (late 90s) I was working with data, models, computational approaches, and I ended up working in risk management in a bank while still working on my PhD research involving statistical modeling of financial prices. In 2006 I came to California to be the Chief Scientist for Epoch, essentially doing data science (data analysis, modeling/machine learning, data visualization etc.) way before the “data science” term has been used to describe this. In 2009 I started organizing an R meetup in Los Angeles (which retrospectively was the very first data science meetup in LA) with the goal of bringing together data professionals to learn from each other. More recently I started other meetup groups that are focused on my other professional interests (DataVis, Data Science), and finally a few weeks ago with the involvement of a couple of other volunteers, we started datascience.la, a website serving the growing LA data science community.
Q - How did you get interested in working with data?
A - I was always interested in math, physics and later in computers (which, for me, the first was C64 in the late 80s). Later on I got involved in data, modeling and computing, my Monte Carlo simulations in the field of materials science (dislocation systems) generated lots of data that needed to be analyzed, I think that's how I started more seriously to use tools for data munging/analysis/visualization.
Q - So, what was the first data set you remember working with? What did you do with it?
A - There were a couple of datasets like my running times that I used paper and pencil to graph them (late 80s). Later on I played with inputing the data on the C64 and graphing it that way. Without reading any formal literature at that time, I became fascinated by the power of visualization (e.g. to see trends or detect outliers).
Q - Was there a specific "aha" moment when you realized the power of data?
A - I cannot pinpoint a specific time, but there is also a trend: the more sources we have for data (collected/generated etc) the more useful it can become. But there are also dangers especially to privacy and security as it becomes more and more clear.
Q - Makes sense … On that note, what excites you most about recent developments in Data Science?
A - First, with all the hype let's recognize that data science is many decades old (we could go back even to John Tukey). Many of the machine learning algos have been developed in the 90s or 2000s. The basic software tools used by most data scientists are over 10 years old. One the other hand, besides this solid foundation, there is an extraordinary pace of new developments. Many of the new add-on tools have increased hugely my productivity (e.g. Rstudio, knitr, shiny, and more recently dplyr). Many others make it possible to do things that was not possible before (increasing computing capacity also helps). We also have now open source tools to tackle larger and larger datasets (Hadoop, but more excitingly for data scientists tools that support interactive analysis such as Impala or Spark).
A - I think it's all over the place. We are getting/collecting data from more and more sources, more and more industries, from sensors, from humans, from crowdsourcing and the list goes on. Next this data is processed, analyzed, used to improve processes. It's hard to imagine any industry that will not benefit.
Very true! Thanks for sharing all that background. Let's switch gears and talk about your role promoting the data science community in LA...
Q - How did you come to found / organize Data Science LA?
A - DataScience.LA has its roots in the LA R meetup that I started with Prof. Jan de Leeuw in 2009. While it was an R meetup, my goal from the beginning was to put everything in a more general context of data analysis / modeling. With the raise of “data science” as a term for our essentially old craft, we started to have events on more general topics and ultimately I started new meetup groups to focus on specific parts of data science (DataVis) or the overall process of combining tools in various companies. DataScience.LA takes this to a new level, by preserving the content of the meetups (slides, code, video recording etc.) and involving the community in new ways (such as blogging). It is also a way to scale up the community leadership by involving top-notch data scientists from LA in serving the needs of the growing data science community.
Q - That's great! So what are your primary goals?
A - I touched a bit on that in the answer to the previous question, but in one phrase it would be building a world-top data science community in LA.
Q - On that note - What has been the most memorable meet-up presentation(s) you've seen?
A - We had many-many excellent talks from professionals in LA and outside LA, for example at the R meetup we had Hadley Wickham, Dirk Eddelbuettel, Michael Driscoll, Ryan Rosario (to name just a few of the better known names from outside LA). At the DataVis meetup we had a fascinating talk by the LA Times' Data Desk, while at the Data Science/ Machine Learning meetup we had talks e.g. by Netflix, Activision (Call of Duty) and Factual.
Q - Wow - that's a terrific bunch of speakers! … What advice would you give to others looking to organize a Data Science group/meet-up in their own city?
A - I would encourage everyone, it's such a rewarding endeavor. It ultimately boils down in getting speakers, a venue (and a sponsor for food) and most importantly members. Use meetup.com, it takes care nicely of all administration for you. For venue, talk to companies willing to host (and provide food), it is much easier now than say 5 years ago. Don't get discouraged by low attendance, we had about 30 people attending the R meetup in the first 2 years (well, the number of R users in general exploded only after that).
Got it - good advice! Now, we'd also love to talk a bit about your role at Epoch...
Q - How are you using Data Science at Epoch? What types of questions does it help you solve?
A - Epoch is an online credit card transaction processor, so obviously the main problem is fraud detection, but there are many other areas for example in sales tracking, marketing or consumer satisfaction that can be improved by models or insights from data. Epoch was wise enough to hire a data scientist way before “data science” got hot and we have developed several sophisticated tools starting many years ago.
Q - What has been the most surprising insight you've found?
A - Unfortunately I'm not allowed to share details about Epoch, but my general philosophy is to start with a business problem a company needs to solve (usually improving the bottom line), understand the domain, look at the data and come up with solutions that are best suited for the problem – the outcome can be an advice for an action or a model that can be deployed. Sometimes simple things such as a real-time dashboard can provide a lot of value (monetarily), in other cases you might need a fancy machine learning algorithm.
Q - Makes sense! How does your team interact with the rest of the organization?
A - In any data-driven organization, data science should have a central role. It has to interact with (advise) top management and it has to connect to all parts of the organization where data can drive decisions or optimize processes. This is fairly easy to do in a small organization but in a larger one it has its challenges. Ideally, data scientists learn the domain knowledge in the various parts of the organization, explore the data, give strategic advice, develop models that can operationalize micro-decisions, but they also disseminate a data-centric view across the organization and mentor key personnel in other departments so that they can use increasingly data and results of data analysis and models in their day-to-day job.
Thanks for sharing all that detail - very interesting! Good luck with all your endeavors - both at Epoch and in the broader LA Data Science community! Finally, let's talk a bit about the future and share some advice … ...
Q - What does the future of Data Science look like?
A - Let me quote Niels Bohr: “Prediction is very difficult, especially if it's about the future.” Data science is good at predicting micro-events where we have data about lots of past micro-events, we fit a distribution (implicitly most of the time e.g. in some non-parametric model or by applying some learning algorithm) and we assume our world is stationary. Predicting macro-events in society or technology is a completely different thing.
Q - Any words of wisdom for Data Science students or practitioners starting out?
A - Sure. Get a balance of theory and hands on experience. For theory there are numerous books, free classes etc. For hands-on spend time with “looking” at data. This involves mostly tedious data munging, but besides preparing, cleaning the data you gain understanding about the data and the domain. If you do modeling, make sure you understand how it works, what are the assumptions, limits, pitfalls, and spend enough time with evaluating your models.
Szilard - Thank you ever so much for your time! Really enjoyed learning more about your background, your role building the LA Data Science community and your work at Epoch. Good luck with all your ongoing projects!
P.S.If you enjoyed this interview and want to learn more about
- what it takes to become a data scientist
- what skills do I need
- what type of work is currently being done in the field