We recently caught up with Kirk Borne, trans-disciplinary Data Scientist and Professor of Astrophysics and Computational Science at George Mason University. We were keen to learn more about his background, his ground-breaking work in data mining and how it was applied at NASA, as well as his perspectives on teaching data science and how he is contributing to the education of future generations...
Hi Kirk, firstly thank you for the interview. Let's start with your background and how you became interested in working with data...
Q - What is your 30 second bio?
A - I am a trans-disciplinary Data Scientist and Professor of Astrophysics and Computational Science at George Mason University. My professional career was primarily astrophysics for two decades, but I focused on data systems for large space astronomy projects at NASA during the years following my graduate and postgraduate work. That focus on data led me into the field of data science starting in 1998. I left my NASA position and moved to GMU in 2003 to pursue two things: data science research and the creation of the first data science undergraduate degree program in the world.
Q - What appealed to you about getting a doctorate in astronomy from CalTech?
A - Caltech was (and still is) the world’s leading university for astronomy and astrophysics graduate education. Ever since high school, my goal was to go to Caltech and use the big telescopes at Mount Palomar Observatory, whose astronomical images appeared in all of the astronomy books that I read during my youth. In order to pursue a career in astronomical research, a PhD is required, and Caltech is second to none for that.
Q - What was the transition from Academics to working at NASA like?
A - The transition was mostly seamless for me since I used large telescopes at Caltech and in my subsequent postdoctoral research positions at University of Michigan and the Carnegie Institution of Washington. Therefore, the fact that my first job was as a supporting scientist for NASA’s Hubble Space Telescope (HST) was a natural step for me. The HST Science Institute in Baltimore was growing into becoming the absolute best astronomy research institute in the world at that time (e.g., three of its associated scientists have won a Nobel Prize in the past dozen years). I wanted to be part of that growth and that telescope. My research on colliding galaxies continued throughout and beyond that transition. There was really no transition, other than the normal one that occurs when going into your first real job.
Q - Makes sense ... So what was the first data set you remember working with? What did you do with it?
A - Well, if you want to talk “small data”, I worked with professors on two small astronomy data research projects when I was an undergraduate at LSU in the 1970’s. For one of those, I analyzed data on the hottest and bluest stars in the Milky Way Galaxy, which led to the discovery of some very unusual stars (which we now call cataclysmic variable stars). For the other project, I helped create the discovery star charts for some "stars" that didn’t seem to be stars at all – many of these turned out to be quasars. That was very exciting. My first independent data project as a graduate student was to analyze the shapes and distortions of colliding galaxies, as observed through astronomical images obtained at Palomar Observatory. I was able to use those distortions to infer the masses and orbits of the colliding galaxies – I was one of the first astronomers in the world to do that.
Q - That's amazing - very impressive :) Final background question ... Was there a specific "aha" moment when you realized the power of data?
A - As an astronomer, I have used data my whole life. There was never really an “aha” moment with that. But there was a huge "aha" moment when I realized that the volumes of data that we are creating in science were reaching astronomical proportions (pun intended!). That occurred in 1998, when the astrophysics data center at NASA (where I was working) was offered a two-terabyte data set from a single experiment. That data set was larger than the cumulative volume of all of the previous 15,000 space science experiments that NASA had flown during the previous 40 years of NASA history, combined! I knew at that moment that things were drastically changing, and the power of data for new discoveries was now growing beyond our wildest dreams.
Very compelling background - thanks ever so much for sharing! Let's switch gears and talk in more detail about data mining, and your time at Raytheon onward...
Q - Tell us about what you learned from founding and becoming the Co-Director For Space Science in which you and your team carried out research into scientific data mining...
A - I was working as a Raytheon contract department manager in NASA’s Astrophysics Data Facility from 1995 through 2003. In 1998, I realized that that the huge increase in data volumes were leading to huge potential for new discoveries. To achieve those discoveries, we needed the special machine learning algorithms that are used in data mining. I began devoting all of my research time to data mining research, initially on the very same colliding galaxies that I had previously studied "one at a time" but now "many at a time."
By 2001, I had developed something of a reputation as a leading data mining researcher at NASA. I didn’t realize that was happening until October 2001 (about one month after the horrible events on September 11, 2001) – in October, I was asked to brief the President of the United States on data mining initiatives at NASA. I didn’t actually do that, for various logistical reasons, but that event convinced me that we needed to step up our game in data mining. So, I worked hard to convince my Raytheon bosses that the company needed to develop expertise and a corporate capability in information science and data mining (which we now call Data Science, but we didn’t use that phrase in 2001). My efforts led to the creation of IST@R (the Institute for Science and Technology at Raytheon), and I became its first Co-Director for Space Science. I was able to obtain a few small NASA research grants to continue my data mining research within IST@R, which carried over into 2003 when I moved from NASA to GMU.
Q - Very interesting! What were the successes?
A - We secured grants to discover unusual super-starbursting galaxies in large astronomy data sets. We had a grant to build a neural network model to identify wildfires in remote sensing satellite images of the Earth. My colleagues at UMBC and I collaborated on a grant to develop data mining algorithms on distributed data – the algorithms were designed to work on the data in their distributed locations – it was one of the first successful examples of "ship the code, not the data", which now everyone is trying to accomplish.
Q - What would you do differently now given all the new technologies and techniques that have been developed since then?
A - If I could do it again, I would have focused more on the Hadoop and MapReduce technologies, which are still not part of my own skill set. But, I have enjoyed developing and testing new algorithms for discovery and inference. So, I guess I won’t give that up – there is such great pleasure in that discovery process.
Q - Makes sense :) Final question on the data mining front ... How has the work you've done in consulting, managing and developing technologies for data mining changed since you first started working in it?
A - The biggest change is that everyone now wants to do it. In those days, I could not convince most companies that I consulted with that they needed data mining technologies. I would get one or two consulting gigs per year, at most, and those businesses were almost entirely focused on data management, not on data science, or data mining, or discovery. Now, a lot of people have forgotten the importance of data management, data modeling, metadata, data organization, clever data indexing schemes, data quality, etc. So, the pendulum needs to swing back to an equilibrium place where all of the data technologies and processes are in play. Consequently, I no longer need to convince people of the importance of data science – my phone and email are now flooded with dozens of requests for assistance from companies everywhere!
That's a nice problem to have! On that note, let's switch gears and talk about another area that keeps you busy - teaching Data Mining...
Q - What compelled you to start teaching Data Mining?
A - I always loved to teach – it was a natural gift for me. When I started discovering the power of data mining and experiencing the joy of knowledge discovery from big datasets, I was like a kid in a candy store. I wanted everyone to know about it. I was giving talks on data mining in many places. I gave such a talk in the Baltimore area in 2002, and one of the database program directors from the UMUC graduate school was there – he said that they were planning to start a new data mining course in early 2003, and he asked me to teach it. I jumped at the opportunity. I couldn’t imagine anyone getting a job in the modern world where they didn’t have data skills – it became my mission in life to teach data mining to anyone and everyone who would listen to me. I frequently said (then and now) that we need to teach data mining to school kids (starting in the elementary grades), and I still believe that. Of course, the examples and methods that are taught must be tuned to the appropriate grade level, but the concepts of classification, clustering, association, and novelty discovery are all part of our human cognitive abilities from birth.
Q - That's a bold goal - how would you approach that? Approach high schools and then march down the age groups?
A - I am thinking a lot about that these days. So, the answer might be yes. The goal would be to establish professional development workshops for teachers, who might receive continuing education credits as they learn data science and create curricular materials related to it. Watch us and see what happens…
Q - Will do! Back to your current teaching for now though ... As you now teach onsite and online how do they compare and contrast?
A - When I was at UMUC, I taught both online and face-to-face. I still do the same at GMU, so it is not really a change for me. However, the two learning environments are vastly different for me – I can interact more freely and tell my "war stories" from my NASA days more fluidly in the face-to-face class. Also, the one-on-one interactions in the online class are very time-consuming for me, compared to the one-on-many interactions in the face-to-face class, which I find to be much more manageable.
Q - Makes sense. So what about the experience at UMUC excited you enough to help you eventually become a full professor of Astrophysics and Computational Science?
A - I loved teaching the one graduate course in data mining at UMUC, but I really wanted to create a whole data science curriculum, including data visualization, databases, computational methods, data ethics, and more. That was one of my main motivations for going to GMU. I could never do that at UMUC, since my one course was part of a bigger database technologies program. However, I was pleasantly surprised this past year to learn that UMUC now has a graduate degree in Big Data Analytics. I am a member of that program’s advisory board – that is a very gratifying experience for me, the fact that they remembered me and asked me to join their board.
Q - What are your views on teaching Data Mining / Data Science / Machine Learning now? How have they changed since you first started teaching?
A - I think every student in every discipline needs at least one such course. I also believe that every student in a science or engineering discipline needs much more than one course. That hasn’t changed. But what has changed for me is this: I used to think that undergraduates needed to major in Data Science, and so we created a BS degree program for that, but I am now more convinced that students should take Data Science electives or take a Minor in it, to accompany their own choice of Major. That's because we need a data-savvy workforce in all disciplines in our data-rich world.
That would certainly help general workforce data-literacy! Also, a good opportunity to talk a little about the GMU Data Science BS degree program…
Q - What was the impetus to starting a Data Science BS degree program?
A - I was convinced that data would govern the world, and would change everything in business, government, social, academia, etc. So, I was driven to start a program that taught students all of the skills that make up data science, to prepare them for the data-rich world that they would be entering after college. I knew that the job opportunities would be large, but I never imagined that the number of jobs would be as huge as they have now become! So, the impetus was both my love of data science and my belief that it was absolutely essential. We started offering courses in 2007. At that same time, my GMU colleagues and I wrote a proposal to the NSF undergraduate education division to develop the program further – we called it "CUPIDS = Curriculum for an Undergraduate Program In Data Science." It was funded, and we were on our way.
Q - How did you approach it? What tools/methodologies does the program use?
A - We began as (and remain) a science degree program. So, the focus is on science problem-solving, applied math, statistics, machine learning and data mining algorithms, basic programming skills, simulation and modeling skills, and data visualization techniques. We are gradually moving toward a more general data science focus, but we are still keeping our students focused on the core problem-solving, modeling, and data mining skills of a data scientist.
Q - What has been the most surprising insight in creating the program?
A - The most surprising thing is that most students coming out of high school have never heard of data science – almost no students have received any guidance counseling about the importance of data and information in the modern world. Also, most students think of computational and data science as “information technology” – i.e., as word processing, or internet security, or system administration. They aren’t particularly interested in making a career out of that – neither am I. They don’t realize that it is all about discovery, discovery, discovery! And they don’t realize that being a data scientist is the sexiest job of the 21st century. When they do finally realize these things, they then become our best ambassadors and evangelists to their fellow students – most of our recruitment comes from "word of mouth" from student’s own peers. The other big insight for us is that we thought that students could jump from the introductory courses into the advanced courses – it is now obvious that we needed intermediate-level bridge courses, which we subsequently developed – the most successful of these courses has been our "Computing for Scientists" course, which is packed with students every semester.
Q - That's great to hear! How would you describe the goals of the program??
A - I list the goals very simply in this way: Students are trained
- to access large distributed data repositories,
- to conduct meaningful inquiries into the data,
- to mine, visualize, and analyze the data, and
- to make objective data-driven inferences, discoveries, and decisions.
That's a great goal - good luck with the ongoing journey! Finally, as one of the most often cited "Big Data Influencers", would love to get your thoughts on the future of Data Science...
Q - What excites you most about recent developments and the future of Big Data / Data Science?
A - The opportunity to work with many different businesses and disciplines is truly the most exciting aspect of the work. Data scientists can work in many areas – for example, I work with people in astrophysics, aerospace engineering, transportation safety, banking, finance, retail, medical research, text analytics, climate modeling, remote sensing, and more. I now call myself a trans-disciplinary data scientist because my work in data science transcends discipline boundaries – I do not need to become an expert in those fields (i.e., not multidisciplinary) in order to work with people with a different domain expertise than mine. I see a very bright future as more and more organizations get on board the Big Data / Data Science train – there are many new technologies, algorithms, problems to solve, and things to do. It is almost overwhelming, but it is definitely exhilarating. My favorite catch phrase these days is "Never a dull moment!" That sums it all up.
Q - Couldn't agree more! Last one! ... Any words of wisdom for Big Data / Data Science students or practitioners starting out?
A - Start early and often in doing, in learning, and in absorbing data science. It takes some serious education and training, but it is worth it. Be true to yourself – know your aptitudes, your skills, your interests – don't force something that isn't there. There is a place for everyone in this data-rich world. Don't underestimate the power of curiosity, communication, and collaboration skills. They will take you further than just about anything else in life. Above all else, be enthusiastic and passionate about it. If you can see the power, discovery potential, and wonder of big data, then the passion and enthusiasm will follow. The future is very bright for those who are able to derive insights from big data, in any discipline or any job. Find the right opportunity and pursue it. There will never be a dull moment after that.
Kirk - Thank you ever so much for your time! Really enjoyed learning more about your background, your ground-breaking work in data mining and how it was applied at NASA, as well as your perspectives on teaching data science and how you are contributing to the education of future generations. Kirk can be found online at http://kirkborne.net or on twitter @KirkDBorne
Readers, thanks for joining us!
P.S.If you enjoyed this interview and want to learn more about
- what it takes to become a data scientist
- what skills do I need
- what type of work is currently being done in the field