You want to create a data science portfolio to showcase you can “do” data science. That you know how to take in a data set, clean it up, use various techniques to extract useful information from it, and then communicate the results. The problem is that you aren’t sure where to start, what projects to do, what languages to use, or even what techniques to use. You’ve read on several blogs people using just one or two algorithms to make awesome things and so you’re thinking of picking small interesting projects, but you’re afraid it’ll just be a waste of time.
You want the right project that will teach you the right things and showcase the right skills
You’ve found that learning something by doing is better than going through a pile of books. So you want to make sure the project you chose will actually help you not only learn how to “do” data science, but it’ll also help you get the data science job that you want.
Doing a data science project is great interview preparation
By picking the right project, you’ll learn some math, statistics, machine learning, programming techniques, programming languages, and perhaps even get an inside track on getting the job. This project will be the perfect thing to discuss during the data science job interview. It’ll even be a great “calling card” that you can send ahead of time as sample work when you apply to the data science job. It should also help you prepare for the actual interview as you’ll be familiar with some of the work the company is doing. Overall, doing the right data science project will be immensely helpful to the overall interview preparation.
“Show”, don’t “Tell” with your data science project portfolio
As we covered in the article “How You Should Create A Data Science Portfolio That Will Get You Hired”, a key part of the portfolio is being able to showcase your critical thinking and communication skills. To that end, you want to make sure that project you pick conveys that you are familiar with the company, you are familiar with what they are looking for, and that you’ve accomplished some of the things they want you to be familiar with.
Three Steps & One Real Example
The three steps to choosing a data science project for your data science portfolio are as follows (with further explanation below):
- Find a data science job you would take
- Take notes on skills, qualifications, and what you’ll be doing
- Create projects based on that data
The one real example that we’ll look at is a job posting from today January 6th, 2015 for a Data Science position at Capital One. The link will be below as well as the analysis and 3 potential projects to do for this particular posting.
That said, let’s go back and given a bit more description to the three steps we spoke about above.
Step 1 - Find a data science job you would take
The first step is to find a data science job you would take. The reason to do this is that if you are going to put in the hard work to do a project, make sure that it’s actually something you are excited about doing. Websites to look at are LinkedIn, Indeed, Stackoverflow, Careerbuilder, and Glassdoor. These are all “job websites” that have hundreds of listings for “Data Science” roles.
Step 2 - Take notes on skills, qualifications, and what you’ll be doing
The second step is to figure out exactly what this company wants. This breaks down into three things - skills, qualifications, and what you’ll be doing. The skill portion should include programming languages and previous experience. The qualifications portion may also include programming languages, but will also mention experience as well as academic background. Finally, the third portion (and most important) is what you’ll actually be doing. This is where the you’ll get the most ideas of what project to do for your data science portfolio.
Step 3 - Create projects based on that data
Now that you’ve done the previous two steps, you’ll have data regarding what programming languages to use, what data to use, what the employer wants you to have done before, and what potential directions you should go. From this data, you can put your “thinking cap” on and come up with a few different things to explore. Remember - the goal is to “do” data science, not to write a thesis or do something perfect. Do something, write it up, reflect on what you learned and then do it again! :)
Alright, that’s all well and good. Let’s now take a look at a real example.
Example - Choosing data science projects from a Capital One Data Science Job Posting
As described in article How You Should Create A Data Science Portfolio That Will Get You Hired, we’ll look at one job postings to come up with three different potential projects. The job posting we will look at is at the Capital One company where they are looking for a "Data Scientist" who will be located in "New York, NY". The job posting can be found here => Capital One Data Scientist Job Posting [Note: we use web.archive.org to make sure this example doesn’t disappear].
Example - skills extracted from job posting
Let's take a look at the key section that describes skills. Here we see the following:
- Wrangler. You know how to move data around, from a database or an API, through a transformation or two, a model and into human-readable form (ROC curve, Excel chart, map, d3 visualization, Tableau, etc.). You probably know Python, Java, R, Storm, Julia, SQL, Matlab, Mahout, or think everything can be done in a Perl one-liner.
From there we can see a few key ideas: moving data around, using databases, using APIs, performing data transformation, and create visualizations. This also gives us a few programming languages they expect: Python, Java, R, Julia, Matlab, Perl, and SQL as well as some systems: Storm, Mahout, and Matlab.
Example - qualifications extracted from job posting
Let's take a look at the key section that describes qualifications. Here we see the following:
- 3 years’ experience in R, Perl, Python, Java, or other languages appropriate for large scale analysis of numerical and textual data
From these we see a bit more into their preferred programming languages: R, Perl, Python, and Java. We also see what types of data they are using: numerical and textual data.
Example - what you’ll be doing extracted from job posting
Let's take a look at the key section that describes what you'll be doing. Here we see the following:
- Using Hadoop and related tools (Pig, Hive, Impala) to manage the analysis of billions of customer transaction records
- Writing software to clean and investigate large, messy data sets
- Integrating with external data sources and APIs to discover interesting trends (NOAA Weather Data + Credit Card Transactions = Fascinating!)
- Creating machine learning models from development through testing and validation to our 30+ million customers in production
- Designing rich data visualizations to communicate complex ideas to customers or company leaders
- Investigating the impact of new technologies on the future of mobile banking and the financial world of tomorrow
From these we see a bit more about the systems they are looking for: Hadoop, Pig, Hive, and Impala. As well as the relative size of data: billions of customer transactions & 30+ million customers. We also see that part of the job will require cleaning and investigation large, messy data sets. We also see that there will be integration work with external data sources and APIs. We also see some data sets that they are using: NOAA Weather Data, Credit Card Transactions, mobile banking, and general financial transactions. Finally, we also see that the job will involve doing data visualization for non-data scientists, so top-notch communication skills are required.
Example - putting skills, qualifications, and what you'll be doing together
Putting it all together, we now have a list of programming languages you can create your portfolio project in: Perl, R, Python, Java. We have some data that you can use: external data sources, APIs, NOAA Weather Data, Financial Transactions. We have some systems you can use: Hadoop, Pig, Hive, and Impala. We also have some goals: make something that customers and/or company leaders (read non-data scientists) can understand.
Example - Three example projects to do from this one job posting
From the above, we can put together some preliminary thoughts on some sample projects to put together for this one Capital One Data Science Job Posting.
- Data Wrangling :: Get NOAA Weather Data + Twitter Data for Twitter Hashtag #Bought
- Visualization :: Visualize the above data and showcase interesting aspects of it
- Modeling :: Few ideas - does weather affect use of #bought hashtag, does weather affect what other hashtags are used, given weather today can you predict number of #bought hashtag uses tomorrow?
Example - project :: Data Wrangling
The job posting is interested in your ability to use external data sources and APIs. They job posting also mentioned NOAA Weather Data and financial transactions. Based on this, you should extract data, clean it up, and integrate two or more disparate data sources. Bonus points for combining textual and numerical data. Given that they've already mentioned one data source - NOAA Weather Data, you should use that one. For the second one, you'd should do something related to financial data. Since this is normally private information, you have to think of a proxy. One good one is that often times, when people buy something on Twitter, they use the hashtag #bought. Which means you can get data for this hashtag from the Twitter API that is text. Which means the weather data will be numerical and the Twitter data will be textual.
You can then write up, what you found, how you went about finding the data, how you used the external data sources, how you combined them, and what you learned.
Example - project :: Visualization
The job posting is interested in your ability to visualization data and information for non-data scientists. Now that you have data from two different sources with two different types of data, you can come up with some visualizations to show interesting you found and/or visualizations that let others explore what you found. Remembering this is a financial company that operates in certain parts of the world: you can make map visualizations, or you can make an exploration tool that explores, time, dates, the past, or even certain time periods - like the financial crisis of 2008.
Example - project :: Modeling
The job is interested in your ability to do statistical modeling, discover interesting trends, machine learning, and algorithms. This is where you can shine while learning your math, statistics, machine learning, and machine learning techniques. Some interesting ideas / questions you can try to figure out and model are: does weather affect use of #bought hashtag, does weather affect what other hashtags are used, given weather today can you predict number of #bought hashtag uses tomorrow, or anything else where you are using weather to gain insight into how #bought is used? Bonus points here for pointing out demographics and psychographics of twitters population as well as figuring out how to get location from the twitter data so that you can more accurately combine weather and hashtag usage.
The end is the beginning
And there you go. You now have three projects that can build for this specific job posting. Again, you found the job posting, you looked at what they wanted, and you come up with three potential projects to build that matched up directly with what they mentioned. As with most things, the end is the beginning. Now that you've done this, you'll want to do this with another job posting and some other projects. The more that you do, the more proficient you'll become, and the more prepared you'll be for the data science job interview, as well as just working in the data science field in general.
To that end, it's time to get started with the first step! Go out today and start looking at data science job postings to see if there are any that you like. If you find one or a few, copy the text down and start keeping notes so that you can started on decide what data science projects you'll create for your portfolio!