You find yourself overwhelmed with all the various tools mentioned in data science forums. You were hoping MATLAB would be enough, but you are inundated with advice of how to get started. You get advice to learn and practice R. Then you should learn Python. Then you should learn SQL. Then you should learn SPSS. Then you should learn Excel. Then you should learn Rapid Miner. Then you should learn Open Refine. Then you should learn IPython. Then you should learn D3.js. Then you should learn Hadoop. Then you should learn Data Wrangler. Then you should learn Pandas. Then you should... It just doesn’t stop.
You really want to get a data science job and play with data all day long. You absolutely love math, but are not sure academia is right for you for several reasons. So you’ve read some articles and gone through some papers and suddenly find yourself ridiculously excited about machine learning and about how to apply it in a business setting. Now data science is seen as more than just a way out, it’s that you find the whole process interesting as well. So you want to learn the tools that will make you hirable as a data scientist, but aren’t quite sure where to begin.
What tools do employers want data scientists to know?
Since your goal is to get hired as a data scientist, one concrete way to understand what hiring managers are looking for is to ask them. Since you are new, this can be a bit tougher than it looks. So the second best way is to look at data science job advertisements to see what tools are listed.
Since the field (data science) is so big right now that what tools different companies and groups use will vary significantly. Some data scientists mostly build data cleaning services. Some data scientists do academic-style research. Some data scientists do a mix of all of the above to varying degrees. Before drilling down into all the various types of data science roles that exist and the specific tools that they use, we’ll do a brief survey in order to get a sense of all the possible tools that are mentioned in relation to working as a "data scientist" in the industry.
When in doubt, look at the data
To better understand what data scientists get hired to do, here’s what we’re going to do. We’re going to look at CareerBuilder (a career website) and look at the first 2 pages of search results for the keyword "data scientist". This will cover 50 job postings. For each listing, we’ll go into it and figure out what tools the data scientist job listing mentions. Then we’ll put together a list of tasks that appeared. We will then sort the results by the number of times that specific tool or technology was mentioned. Note, the results may vary when you are reading this, as this search is being done today (December 17, 2014).
Note:, though we are only going to do CareerBuider, here are a list of recommended job websites you should take a look at...
- http://www.careerbuilder.com (what we’re using in this article)
Data science job tools
The URL that we’ll use is the following one => CareerBuilder Data Science Jobs and the search will be for "data scientist".
Here are the tools and number of times they showed up:
- R x 30
- SQL x 27
- Python x 22
- Hadoop x 19
- SAS x 18
- Java x 15
- Hive x 13
- Matlab x 12
- Pig x 11
- C++ x 9
- Ruby x 9
- SPSS x 9
- Perl x 8
- Tableau x 8
- Excel x 6
- NoSQL x 5
- AWS x 4
- C x 4
- HBase x 4
- Bash x 3
- Spark x 3
- ElasticSearch x 2
- PHP x 2
- Scala x 2
- Shark x 2
Data scientists know and master every tool!
As you can see, a data science job descriptions ask data scientists to know 30 tools. All the way from data technologies, to scripting languages, to statistical programming languages. And this was just in 50 job postings (2 pages of CareerBuilder results). Some tools are very similar and others are very specific to certain domains different. This is one of the fortunate or unfortunate things about the data science field at the moment, that it is so big right now that what matters and what you’d actually differs drastically from job to job.
The silver lining behind this list is that most job postings have the following phrase: "know at least one of the following...". Which means that you don’t actually have to go out and learn all of the tools. It just means that you should know at least one of them really well and have a passing familiarity with some of the others ones. You don’t need to know them intimately, you just need to know what they do.
So, if you are looking for a data science job, based on this data, the best way to get started is to learn R, SQL, and Hadoop. Then have a passing understanding of Python and the tools that work with Hadoop like Hive, Pig, and others. This will make it so that you know at least one of the tools that data science positions are looking for and you’ll have a good start to becoming a data scientist.
To get started with R here’s a good introductory tutorial from Google => Intro to R by Google Developers