Building A Data Science Portfolio Project Top Down

Building A Data Science Portfolio Project Top Down


Employers want you to be value-adding from day 1. They don't have time to wait 90, 180, or even 360 days for you to start adding value. To that end, having a data science portfolio which showcases your skills in the areas your future employer wants is a great way to get the data science job of your dreams. You've read the article How To Choose A Data Science Project For Your Data Science Portfolio and now you want to get started. In a previous article Building A Data Science Portfolio Project Bottom Up, you learned about how to think of a project by starting at the bottom - with the data. This article will cover the basics of the top down approach - starting with questions.

Bottom Up versus Top Down

There are two ways to think about doing a data science portfolio project - bottom up and top down. Bottom up means that you start by thinking about the actual data set that is available. Top down means that you start by thinking about questions that you want answered. Both are equally valid, useful, and helpful to think about. Sometimes you'll find that a problem is more tractable with the top down approach and other times you'll find the that a problem is more tractable with a bottom up approach. Regardless of which initial approach you take, it's worth thinking about problems eventually with both approaches.

The Question(s)

In this approach, you start by thinking about the problem and what questions, if answered, would help you solve it. You might even call it questions-then-resources. You have questions you want answered and you will have to go out and find the right resources of data that can help you figure out the question. You may even find that the data you need does not exist, or is in another part of the world, or is in another format, or even that it's currently part of another data set and you will have to extract it. Once you have the problem and the questions, you will have what you need to setup the project.

The Problem

First - ask yourself, what am I or this company trying to do. Unless you work for a financial company, the answer will not be make more money. It will be things like - sell more dresses, help people find more movies they love, show people hotels closer to where they'll be staying, give people more articles they will enjoy reading, help decrease employe turnover, find new candidates, predict which candidates will be successful in this role, etc... That is the problem - what you / the company is really trying to do. Yes, if you / the company does this correctly, more money will be made, but for now, we forget about the money.

The Questions

Now that we have the problem, we want to ask some questions to dig deeper into what potential projects could come out of it. Netflix is going to be used as the example company. Netflix has the problem - help people find more movies they love.

The questions to ask are ones that you've probably encountered before - 6W's:

  • Who
  • What
  • Why
  • When
  • Where
  • How

So you would ask:

  1. Who needs help finding more movies they love?
  2. What movies do they love?
  3. Why do they need help finding more movies they love?
  4. When do they need help finding more movies they love?
  5. Where do they need help finding more movies they love?
  6. How do they need help finding more movies they love?

Now of course, you are looking for a data science job and for this example are hoping to work at Netflix. Which means you will not have access to this data. So you get to answer these questions as you wish! Which is scary and liberating at the same time.

Problem -> Questions -> Data Sets

Now that you have the problem and some basic questions, you get to start thinking about the data sets that you can use to answer the questions and then eventually the problem. To make it easy on yourself, a basic data set to use is Twitter. So now you can start to answer the questions and as you do, some potential data science projects will popup.

You can hopefully see how a good deal of data science projects drop out of these questions. For instance, you could explore the "what" question with twitter data. Given a set of users, can you predict what movies they will love. Given a set of users, can you predict the hashtags of the movies they watch. Given a set of hashtags and user locations, can classify users into various types of audiences. Given a set of hashtags, users, and user location, can you figure out what drives opening weekend box office receipts. And this is just from that "what" question, you can explore all the other questions equally.

Why this matters for a data science job

In a data science job, you will often find clients come forward with data sets and ask you to figure things out - which is the bottom up approach. You will also find many times that your clients / bosses / colleagues come to you with a problem and ask you to look at it. In this case, you will take the top down approach and start asking the right questions.

Which is great, because as you start getting into this mind-set when putting together your data science portfolio to showcase your skills and critical thinking, you'll have to grapple with the questions above and have to defend your choices. Additionally, what you'll often find is that as you ask deeper and deeper questions, you'll see that you need other data sets, to combine data sets, or even to throw away the data sets that you were thinking of using and construct a whole new data set. Being able to describe your thought process that comes up is incredibly useful and helpful because it will show how you think. Which is something that potential employers really want to know and see in action.

The next step

To that end, to start developing practice with asking these six questions about problems you think employers want to solve. Find a few job postings and go through the exercises as if you were already working there. This will be good practice for you.

Lastly, the next time you start working on a data science project for your portfolio, remember to ask whether it's worth using the top down approach by starting with the problem and questions.

Good luck!

Receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe at any time. Your e-mail address is safe.