We recently caught up with Harlan Harris, Co-Founder and current President of Data Community DC (DC2). After multiple years (and degrees!) in academia he transitioned to industry as a Data Scientist in 2009. We were keen to learn more about his background, the vision for DC2 and his views on how Data Science is evolving ...
Hi Harlan, firstly thank you for the interview. Let's start with your background...
Q - What is your 30 second bio?
A - I'm from Madison, WI. Undergrad from UW-Madison in Computer Science with most of a second major in Linguistics. Grad school at Illinois - Urbana/Champaign, where I wrote a dissertation in Computer Science (Machine Learning), while doing a lot of Psycholinguistic and Cognitive Science coursework and research on the side. Cognitive Psychology post-docs at Columbia University/UConn and at NYU. I was good at the pieces, but not at the whole, and couldn't make an academic career work. Switched to industry as a data scientist in 2009, first at Kaplan Test Prep and now at Sentrana, Inc in Washington, DC. I was pretty involved in the professional data Meetup scene in NYC before I moved, and even more so here in DC. ... Also: married, foodie, lapsed fencer.
Q - How did you get interested in working with data?
A - As an undergrad I took several AI classes, including a Machine Learning class taught by Jude Shavlik. That was my favorite undergrad class in my major, and led me to continue with ML as a graduate student. I then got further hooked by the statistics side of things when I started doing psychological research.
Q - What are you favorite tools / applications to work with at home &/or at the office?
Q - What was the first data set you remember working with? What did you do with it?
A - Ever? I think there was a 4th grade science project that involved talking to plants. I'm pretty sure I proved (with n=3) that houseplants do better when you yell at them.
That's a pretty powerful learning at such a young age :) Thanks for sharing your background. Let's talk more about Data Science and how the landscape is evolving...
Q - What excites you most about recent developments in Data Science?
A - I'm fascinated by the idea that we're watching a new kind of professionalization of a discipline. Existing academic and professional boundaries are being redrawn. But unlike the creation of new disciplines in the last century, such as the formation of Computer Science out of Electrical Engineering and Mathematics, we now have a bottom-up, peer-driven community, supported by on-line tools such as Meetup and StackOverflow. Being part of a professional society seems less important now than ever before. But having visibility and credibility - and of course skills - are as important as ever. (See my recent Ignite talk about this...).
On a technical level, it's interesting that Neural Nets are back in fashion in the form of Deep Learning. I'm also interested to see what happens with probabilistic programming and the maturation of Monte Carlo modeling techniques.
Q - What industries do you think will benefit most?
A - Basically anything with repeated processes, lots of data exhaust, and a well-defined success criterion. The relative cheapness of data science techniques these days means that stuff that used to be limited to just governments and enormous businesses can be applied by small teams to things like healthcare analytics and journalism, which is drastically changing those fields. On the other hand, there are a lot of really interesting domains where there's no relevant data, or where you can't usefully define success, or where every situation is basically unique. For example, you can't use predictive analytics to tell you how to write a healthcare law.
Q - What are the biggest areas of opportunity / questions you would like to tackle?
A - Drew Conway famously put Domain Knowledge as a key part of the Data Science Venn Diagram. I'm interested to see whether simple AI systems that have simple domain knowledge capabilities can supplement the statistical tools in a useful way in a broader set of applications. Right now, the domain knowledge is in our heads - is it possible to extract just enough domain knowledge into software so that more people can more efficiently focus on the questions rather than the tools? IBM's Watson is one approach to this, but I think there will be a lot more systems that try different approaches in coming years.
Very thought-provoking - that would definitely transform a lot of professions! Let's change gears and talk more about your involvement in the Data Science Community…
Q - How did you come to found / organize Data Science DC??
A - I was an occasional presenter at the R and other Meetups in NYC before I moved in 2011. When I came to DC, there was an R Meetup, run by Marck Vaisman, but nothing else. Along with a data scientist at WaPo Labs, Matt Bryan, we formed Data Science DC that summer. It was a bit ahead of the times to call the Meetup "Data Science" - everything else was Predictive Analytics or Machine Learning or something. The Meetup's been very successful, and in 2012, we decided we wanted the capability to do bigger and better things, so, along with several others, we created an umbrella organization called Data Community DC, or DC2. DC2 now has six Meetup groups with over 5000 unique members, a board of 12 people, a blog, occasional workshops, and plans for bigger events in the future. I'm the current President of DC2.
Q - What are the primary goals of the organization?
A - Here's DC2's current mission statement: Data Community DC is an organization committed to connecting and promoting the work of data professionals in the National Capital Region by fostering education, opportunity, and professional development through high-quality, community-driven events, resources, products and services.
Within DC2, Data Science DC, which I'm still the primary organizer of, focuses on the "algorithmic" or problem-solving level. Basically, we want to give people an opportunity to share what they're working on and what approaches they're excited about and to meet other people in their professional community, even those who work on wildly different problems and domains.
Q - What have been 3 of the most memorable Meetup presentations?
A - Wow, DSDC alone has had 30 events... Let's see, of those... I really liked the Recommendation Systems event, where two great presenters, from WaPo Labs and LivingSocial, talked about real-life applications of the technology. We had a presentation by a team at the Sunlight Foundation that involved everything from problem formulation to data collection to graph analysis to data visualization. Another great one was a panel discussion about Data Science in political campaigns - entertaining and fascinating. In all three cases, our presenters had real problems, in retail, or journalism, or marketing, and used a very wide variety of tools and techniques to do things that would have been flat impossible, or taken orders of magnitude more resources, just a decade ago. It's really inspiring... The other DC2 Meetup groups have all had amazing events too!
Q - What has been the most surprising insight / learning from organizing the group?
A - Hhmmm. One thing is that almost everybody who gathers up the courage to give a presentation to scores or hundreds of their peers knocks it out of the park. It gives me amazing faith in humanity that everyone seems to be so good at their jobs!
Q - What advice would you give to others looking to organize a Data Science group/Meetup in their own city?
A - Get sponsorship, and minimize support from your employer. Astroturf Meetups don't last. But there are many, many great companies that would love to chip in some money for potential customers and employees to get pizza and soda before presentations. Don't be afraid to ask individual people who you think do interesting work to speak - most will, and do a great job. Steal ideas from Meetups in NYC. :)
Makes sense :) Harlan, what you have managed to build for the Data Science community in DC is really impressive - look forward to hearing more about the various groups going forward! Finally, it is advice time...
Q - What does the future of Data Science look like?
A - There will be people coming out of academic programs with Masters degrees in Data Science very soon. It'll be very interesting to see how those people interact with people who pivoted professionally. There'll be more certification and more coherence in terms of what people know and are expected to be able to do.
I suspect techniques for Big Data analysis will continue to be important, but perhaps relatively less so over time as those tools mature. Medium Data, where you have to think about the scale of the problem to solve it, but where you can move the data around without too much problem, will be where most of the action is. ... I'm also personally interested in the impact of Open Data and Civic Analytics on people's lives around the world.
Q - Any words of wisdom for Data Science students or practitioners starting out?
A - Get involved in your professional community, whether it's attending Meetups (and meeting people at the bar afterwards), or answering questions on StackOverflow or CrossValidated, or trying your hand at a Kaggle competition or a hackathon. Learn about the many different points of view of people doing work related to your interests.
Harlan - Thank you so much for your time! Really enjoyed learning more about the evolving Data Science landscape and what you are building at Data Community DC. DC2 can be found online at http://datacommunitydc.org/blog and Harlan Harris online at http://www.harlan.harris.name or on twitter @HarlanH.
Readers, thanks for joining us!
P.S.If you enjoyed this interview and want to learn more about
- what it takes to become a data scientist
- what skills do I need
- what type of work is currently being done in the field