Statistics, Programming, & Data Science

6 min read Original article ↗

How to shape your skill set to be a good data scientist

I've been doing data science since 2005. It wasn't called that when I started. I've been doing data mining since 1995. That term evolved during the 1980's when I was in college. They are very different tasks, but there is a lot of overlap. I would say that, as of today, the conceptual understanding of what a data scientist is, what data science is, and what are reasonable expectations to have about both, tends to be misinformed. When you look at the way recruiters look for data scientists, or the way companies think they will use data scientists you notice that there are widely differing opinions and understanding of the job. I hope that in writing about this topic I can bring some clarity to the people interested in becoming a data scientist, as well as to those looking to institute data science in the business, or recruit people capable of doing data science.

My opinion is that there are certain aspects of data science that drive what a data scientist is. It is not a canon list. I won't tell someone they are doing it wrong, however they go about data science. However, it will suggest if the approach is what I consider to be correct. I believe at its most basic level, the purpose of data science is to discover hidden, and meaningful relationships in data. When I say hidden, I mean a lot of things. Often I mean that basic summary information does not suffice, that it takes some processing to discover the truth. I might also mean that the relationship might only be an inference, and not directly measured.

Meaningful relationships refer to relationships, whose knowledge of empower effective action. These relationships are not spurious, as happens in some approaches. A spurious relationship would be a meaningless correlation, as seen in the example illustrated above. This is one of the most important aspects of data science -- getting it right. If the science doesn't get it right, then the decisions that come from the model may produce no result, or negative results.

The three skill sets a data scientist must develop are:

  • Programming skills (some consider these optional, but I do not)
  • Statistical knowledge
  • Scientific literacy

I don't even mention machine learning and AI. While those are important tools, I consider them a subset of programming skills because without good programming skills you cannot effectively leverage those tools. I consider them important enough to place programming skills at the top of the list. If you are going to be making mission critical analysis, is your software mission critical? Anyone can do single processor programming. When you have hundreds of billions of data points to process, knowing how to parallel program can come in handy. The quality of your models may depend a lot on the quality of the code you write.

Statistical knowledge is very important. Arguably, data mining starts with Thomas Bayes' paper comparing current probability to past probability. It was published posthumously in 1763. Knowing summary statistics, distribution models, statistical functions, statistical inference, and analytical skills. Understanding the central limit theory, for example, or why median and average are useful in different situations are key. Probability is a very complex and technical subject. A bad assumption can ruin an entire model. Knowing how to detect the appropriateness of a statistical assumption is a key skill.

Finally, I list scientific literacy as the final skill. This is the bedrock knowledge one must possess. How do you know what you know? Why do you think what you know is right? In my opinion, this is what science is for. How do you evaluate evidence? How do you know you have enough evidence of a high enough quality? Understanding scientific practices will go a long way in helping you be confident of your results.

When a company is looking for data scientists, they should make sure that they know why they are looking for a data scientist. If its because you want your visualizations to look better, hire a graphic designer. If its because you want to know how to be more efficient in your widget sales process? You probably just need a business intelligence analyst. A data scientist or, even better, a data science team can disrupt your company and the industry you are in. They can find new ways of doing business. They can find patterns that lay underneath your current understanding, make predictions, and if they are really good, make prescriptive models. Not every company needs data science. Few do, actually, even if they are innovating. Not every problem is well suited to data science approaches, either.

When you are looking at a candidate data scientist, it is important to evaluate their strengths in programming, statistics, and science. A star programmer who has done several Machine Learning tutorials, and who has heard of statistics will have as many weaknesses as a star statistician who was able to program a ggplot2 graph using R. It takes time to shore up the skills needed. An entry level data scientist should know enough to describe cleaning data, applying business objectives to data analysis, understand what sample sets are, know basic statistics, and have agility at programming. You can do data science in just about any programming language. Knowing more than one might be an asset as long as they are proficient in at least one of them. As rank in data science is increased, knowledge in each of the three disciplines is needed. Candidates, and those seeking them, need to be aware of what is expected and necessary for success. A well educated hiring manager is just as important as a well educated candidate data scientist.

Those are my thoughts. I'd love to respond to any feedback. What skills do you think are vital? What purposes do you feel are best suited for data science?

If you are going to JuliaCon 2018 in London, UK, leave me a note. I'd love to meet anyone there that is interested in data science.