Types of data scientist

A couple of days ago, I read this O'Reilly report, Analyzing the Analyzers, by Harlan Harris and co-authors with a lot of interest. It's a quick and pleasant read, and definitely worth looking at.

Considering the many news articles I see mentioning big data, the authors make a sobering observation on this subject (p16-17):

"Most data scientists rarely work with terabyte or larger data… True big data work seems limited to a relative small subset of data scientists."

Since it was a self-selecting survey, there is the possibility of bias to consider, but the drop off when comparing the percentage of respondents regularly working with terabyte data sets compared to gigabyte data sets is staggering. The survey was also completed around a year ago, so there is always the possibility that trends have changed since then, especially in the fast moving tech world.

What the report highlighted for me was the diversity of skills that come under the umbrella of data scientist. They classified data scientists as having one of five specialisms: business, machine learning/big data, maths/operational research, programming, and statistics; all very different career paths.

Related to this, I recently read a couple of posts on Ryan Swanstrom's Data Science 101 blog (worth a look; there are some useful links there) stating that "data science is more than just statistics". He concludes that taking a data science course is a better decision than taking statistics if you want to forge a data science career.

Ryan distinguishes the two paths in terms of how they handle data. Data scientists consider existing data and find ways to do something with it, whereas statisticians may be involved in well-designed experiments from the outset.

Although that's quite a nice distinction, I think there's a little more to it than that. A statistician may not know much about extracting data from different information sources, storing it and accessing it. On the other hand, a data scientist may be adept at these skills, but might not have the same depth of statistical knowledge.

I certainly think that the topics covered in a data science course are going to be fairly widely applicable to several kinds of tech jobs, and will probably give a good grounding in several disciplines. One issue is that these courses are only just launching, so it will be a while before anyone can really see how effective they are in getting employment and on long-term career prospects compared with studying subjects such as statistics or computer science directly.

Of course, even if you specialise, it is always possible to acquire some of the other necessary skills outside of the classroom. If you're training to be a statistician, there's nothing to stop you learning programming in your own time, or using tools like R or pandas for your work.

Looking at the above report, the authors emphasise the need for a broad knowledge base (p19), which might imply that a more general data science course is a great idea. However, their very next point is that their opinion is that a strong specialism — "be it statistics, big data, or business communication" — makes for the best data scientists.

Data science is more than just statistics, but it may not be that being a specialist in some field is disadvantageous either (particularly if most of your peers have data science qualifications and you have something unique to offer).

In my case, after spending so long in education already, it would be difficult for me to justify taking more time out from employment to take one of the several data science courses that are springing up in the UK. So, for me, a final encouraging finding of this report is that one of the main routes to a data science career is from academic scientific research, and fits with DJ Patil's view from several years ago that "the best data scientists tend to be 'hard scientists'".