You are going to build a data science portfolio in order to showcase your skills. You will use it in order to attract potential employers, as well as something to speak about during the actual interview. You've read the article How To Choose A Data Science Project For Your Data Science Portfolio and have a few ideas of projects. The question now becomes, what do you do next... This article, will cover the basics of the bottom up approach - starting with the data.
Bottom Up versus Top Down
There are two ways to think about doing a data science portfolio project - bottom up and top down. Bottom up means that you start by thinking about the actual data set that is available. Top down means that you start by thinking about questions that you want answered. Both are equally valid, useful, and helpful to think about. Sometimes you'll find that a problem is more tractable with the top down approach and other times you'll find the that a problem is more tractable with a bottom up approach. Regardless of which initial approach you take, it's worth thinking about problems eventually with both approaches.
You start by looking at the data set and trying to come up with things to investigate. You might even call it resources-then-questions. You have a resource of data, and now you need to come up with questions. You may even think that this needs a good deal of creativity to come up with something useful. This is not the case. Once you have the data, there are only a few questions that you need to ask that will setup the project.
What a data scientist is and does
"Data Scientist" is an industry term - that is, it's a term being used by companies to hire people to do data science. One of the bootcamps (Insight) describes a "data scientist" as follows:
one job of a data scientist is asking the right questions on any given dataset (whether large or small)...After finding interesting questions, the data scientist must be able to answer them!So a data scientist, in hyper-simple terms, is someone who works for a company that asks the right questions and then answers them.
A for profit business at all times is either trying to make more money or cut costs. So as a data scientist with any given data set, you should be able to answer those questions or at least come up with ideas for how to approach those questions given any data set. Further, the way that a company makes money is by solving someone else's problem - so basically, by helping them.
The two questions to ask of your data set are as follows:
- How would you make someone money with it?
- How would you save someone money with it?
These seem rather simple and some what crude questions. Going back a few paragraphs, why they matter is that they are being asked in the context of a for-profit business. The way a business makes money is by helping people do something they don't want to do themselves.
To continue to operate (unless the company is a VC-backed startup or Amazon), the company will constantly be striving to have positive profits. To have positive profits, a company will either have to make more money, have to cut costs, or do both. So by thinking of these two questions, you will have some good ideas of how to use the data set and what questions to ask of it.
Why this matters for a data science job
In a data science job, you will often find clients come forward with requests like "here's my data, what can we learn from it?". Regardless of what you eventually learn from it, or the recommendations, or insights that are achieved by working together, ultimately everyone in the company will be judged whether they help the company move forward towards it's stated goals (the industry it is in and what it is selling) as well as it's intrinsic goals (make more money while spending less money).
So it's great, because as you start getting into this mind-set when putting together data science portfolio in order to showcase your skills, this will help you hone in on what is important and what can be useful when looking at the data. Additionally, what you'll often find is that before you get to the modeling or heavy duty math, there will be problems with the data that are actually causing issues today. Maybe it's missing data, maybe it's erroneous data, maybe it's that it takes the data six days to download and you can decrease that to a couple of hours, or maybe it's something like making the data available to others in the company. Being able to describe possible issues that can come up with data are very valuable as what you'll find frequently in the real world is that data is incredibly messy and nothing like the squeaky clean data sets found online in contests on Kaggle or elsewhere.
The next step
To that end, to start developing practice with asking these two questions about data, go over to the datasets sub-reddit and pick one data set. Then ask yourself: How would you make someone money with it? and How would you save someone money with it?. This will be good practice for you.
Lastly, the next time you start working on a data science project for your portfolio, remember to ask whether it's worth using the bottom up approach by using the data.