Tag Archives: science

What is Data Science and (closely related) what is a Data Scientist?

I came across an interesting read recently on the definition of both data scientist and data science. Now, even though I’m about to disagree with almost everything in the article, that doesn’t mean I think it’s wrong-headed or not worth a read. It’s a fairly conventional, industry standard view of the world and provides a common-sense and reasonable set of definitions for both data scientist and data science. I’d encourage you take a look if you’re interested in this type of question.

Meanwhile, if you’re willing to rely on my summary, here’s what I take to be the gist of the article:

  1. Data Science is about finding insights in data to make better decisions
  2. Data Scientists bring to bear three primary skills: subject matter expertise, programming and data manipulation skills, and statistical knowledge to find those insights.
  3. Using survey techniques and asking data professionals to classify their skills, there are four major styles of data scientist. Three styles (business management professionals, developers, and researchers) map directly to the three key skills elaborated above (subject matter expertise, programming and statistics). Then there’s a fourth category appropriately titled “Creatives” who aren’t good at any of these skills…okay I jest…perhaps it’s more fair to say they are balanced fairly equally across the skill sets.
  4. Popular analytics methods (SMART and CRISP-DM) are essentially no more than variants of the “Scientific Method” and, when you get right down to it, data science is nothing more (or less since the diminutive is not meant to imply anything) than the application of that method to whatever problem a data professional is trying to solve. In other words, and here I quote directly, “data science just is science”.
  5. Science works via the “Scientific Method” described as:
    1. Formulate a question or problem statement
    2. Generate a hypothesis that is testable
    3. Gather/Generate data
    4. Analyze data to test the hypotheses / Draw conclusions
    5. Communicate results to interested parties or take action

That’s it. And you’re probably wondering how or why I would disagree with any of this since it’s pretty innocuous stuff. Yes, I’ve written in the past about my suspicions around the whole ‘data science’ term – though heaven knows I use it myself since the market seems to reward it. Taken as it generally is, it’s either a cunning replacement for the label statistician (since we all “know” statisticians aren’t much use when it comes to driving business value) or a demand that analysts should have “full-stack” skills. I don’t necessarily buy the idea that full-stack skills are critical or that there’s a huge benefit in combining them in a single person instead of spreading them across a team, but it’s not something I lose sleep over.

What’s more, once you start flavoring data scientists based on their real proficiencies inside that three-part set, you’re really just back to having analysts (the subject matter expertise folks), programmers, and statisticians. The same people you always had except now they call themselves data scientists and charge you quite a bit more for doing the same stuff they’ve always done. Since I’m one of those people, I not deeply opposed to the whole trend. Here’s a way to think about all this that I think is a little more useful.

None of which is really worth bothering to disagree about though. It’s semantics of a fairly uninteresting sort.

No, what really bothers me about this conventional view is encapsulated in the last two claims:  #4 and #5. The idea that data science is science and that the scientific method is applicable to business analytics. I’m not at all sure that business analytics is or should aspire to be science and I’m quite sure that the scientific method won’t save us.

On the other hand, I agree with the first part of the claim in #4. Namely, that methodologies like CRISP-DM are just faintly warmed over versions of the scientific method.

Despite what most people would assume, that’s not a good thing and here I’m going to go all “philosophy guy” on you to explain why, and also why I think this is actually a pretty important point.


Debunking the Scientific Method

In the past five hundred years, the dominant theme in Western culture has been the continuing and astonishing success of the scientific endeavor. Only the most hardened skeptic could doubt the importance and success of scientific disciplines like physics, chemistry and biology in dramatically improving our understanding of the natural world. When it comes to the success of the scientific endeavor, I’m not skeptical at all. It’s worked and it’s worked amazingly well.

But why is that?

The popular conception is that science works because scientists apply the scientific method – testing theories experimentally and proving or refuting them. It’s the five step process enumerated above.

And it just isn’t right. Since way back in the day when I was studying philosophy of science, there’s been a broad consensus that the “scientific method” is a deeply flawed account of the scientific endeavor. Karl Popper provided the best and most influential account of the traditional scientific method and the importance of refutation as opposed to proof. Thomas Kuhn pretty much debunked that explanation as an historical account of how science actually works (despite having his own deeply unsuccessful explanation) and Quine absolutely destroyed it as an intellectual model. It turns out that it’s basically impossible to refute a single hypothesis in isolation with an experiment. Quine actually influenced my thinking on why KPIs, taken in isolation, are always useless. Depending on the background assumptions, any change of a KPI (and in any direction) can have diametrically opposed meanings. It’s pretty much the same thing with a hypothesis. You can rescue any hypothesis from experimental refutation by changing the background assumptions. What’s more, Kuhn showed that this happens all the time in science – punctuated by dramatic cases where it doesn’t.

I doubt there is a single working historian or philosopher of science who would accept the “scientific method” as a reasonable explanation for how science works from either an historical or intellectual perspective.

What’s more, the scientific method as popularly elaborated is almost contentless. Strip away the fancy language and it translates into something like this:

  1. Decide what problem you want to solve
  2. Think about the problem until you have an idea of how it might be solved
  3. Try it out and see if it works
  4. Repeat until you solve the problem

Does this feel action guiding and powerful?

It feels to me like the sort of thing you might sell on late-night TV. Available now, limited time only – a one stop absolutely foolproof method for solving any problem of any sort in any field! The Scientific Method! Buy!

The only part of the scientific method that feels significant in any respect is that requirement that your idea should be capable of specific refutation (testable) via experiment. Sadly, that’s exactly the concept that Quine showed to be impossible. So the scientific method as popularly understood is pretty much a bunch of boilerplate with one mistaken idea bolted on.

The idea that this type of general problem solving procedure is the explanation for the success of science seems implausible on its face and is contradicted by experience.

Implausible because the method as described is so contentless. How do I pick which problems to tackle from the infinite set available? The method is silent. How do I generate hypothesis? The method is silent. How do I know they are testable? The method is silent. How do I test them? The method is silent. How do I know what to do when a test doesn’t refute a hypothesis? The method is silent. How many failures to refute a hypothesis is enough to prove it? The method is silent. How do I communicate the results? The method is silent.

If what we want in a methodology is a massively generalized process that provides zero guidance on how to accomplish the tasks it lays out and has one impossible to meet demand, then the scientific method is great.

Hence the implausibility of the claim that the scientific method is a reasonable explanation for why science works. The scientific endeavor is neither defined, nor described, by the scientific method.

On a less important note, I’m not at all sure that it’s correct to think of data science as even potentially a scientific endeavor – at least when it comes to business analytics. The belief that the scientific endeavor works in general is broadly contradicted by experience – it doesn’t work for everything. Yes, the scientific endeavor has worked extraordinarily well in physics and biology. But smart people have tried to emulate the scientific approach in lots of other places too. Fields like history, sociology, philosophy and psychology (and lots of other disciplines as well) have all drunk the “scientific method” moonshine with a conspicuous absence of success. Clearly something about the scientific endeavor makes it very effective for some types of problems and not effective at all for others. That seems to me a pretty important fact to keep in mind when we claim that business analytics and data science are “just science”. It’s comforting to think we can re-cast business as science, but it’s not clear why we should think that’s true. I’ve never thought of business analytics as a truly scientific enterprise and renaming it data science doesn’t make it seem any more  likely to be so.


Why CRISP-DM and most other generalized analytics models are the scientific method…and LESS

Unfortunately, methods specific to analytics like CRISP-DM are worse not better. They lack even the idea of specific testability which, though incorrect, at least made some sense as a driver of a method. CRISP-DM lays out a process for analytics that essentially says it works like this: figure out what your problem is, figure out what data you need, setup your data, build your model, check your model, deploy your model.

Wow. That’s very helpful.

Here’s a CRISP-DM like method for becoming President of the United States.

  1. Decide which political party to join
  2. Register as a candidate for president
  3. Create lots of positive press about yourself and your positions
  4. Raise a lot of money
  5. Convince people to vote for you

Armed with a cutting-edge method like this, your path to power is assured. Donald Trump beware!

Really, how different is CRISP-DM from this? It adds a few little flourishes and some academic language but it lives at the same level of empty generality. I suppose it’s good to know that you deploy models only after you build them, but I’m thinking a formal methodology should give us a little more utility than that.

Methodologies like Six Sigma or SPEED (which I laid out last week and which is why this topic is much on my mind and seems important) provide something real and essential – they provide enough guidance to actually drive a process.

As a side note, I’d point out that successful methodologies are nearly always domain specific (SPEED is entirely specific to digital analytics and Six Sigma has been mostly successful in a very specific range of manufacturing production problems) for the simple reason that generality destroys utility when it comes to method.


So is Business Analytics a “Science”?

It’s a real question, then, whether business analytics can reasonably be considered a science and, in fact, it’s a much more ambitious claim than most people would realize (at least when it’s cloaked in the idea that data science is a science – after all, it says science right there in the title). I’m highly skeptical of the idea that data science is science because I’m highly skeptical that business analytics problems are scientific problems.

They don’t seem like it to me. Business analytics problems map very poorly indeed to the natural sciences and only very partially to the social sciences where the track record of the scientific endeavor is, to say the least, mixed.

So claiming that data science is about using the scientific method on data problems might seem like a “Mom and Apple Pie” kind of thing, but I think it’s wrong on two counts.

It’s wrong because business analytics problems are not obviously the types of problems that are scientific. I can’t say for sure that they aren’t – and I might be persuaded otherwise – but first glance I think there are strong reasons for skepticism and little reason to think that advocates of this view really understand what they are saying or have good reasons to back their claim.

It’s especially wrong because the scientific method as popularly understood is neither meaningful nor a method. This is important. In fact, this is the one really important thing you really should take away from this post. If you think hiring data scientists ensures you have a method (and not just a method but a “scientific” one), you’re going to be sadly disappointed. Data scientists don’t arrive at your doorstep complete with a real method for continuous improvement in digital.  It doesn’t matter how data sciencey they are. And if you believe that telling your analysts to use the “scientific method” is going to make your analytics more successful…well that, my friend, is even more absurd.

I have strong reasons for thinking that Six Sigma (for example) isn’t an appropriate methodology for digital analytics. But at least it’s a real method. Flawed as it is when applied to digital analytics, it’s rather more likely to drive results than the “scientific” method. And, of course, I have my own axe to grind. The methodology I described in SPEED is purpose-built for digital and is action-guiding. I’d love to have people adopt and use it. But even if you don’t like SPEED, the importance of having a real method and using that method to drive continuous improvement shouldn’t be discounted.

Go ahead, build your own. Just make sure it’s not of the “figure out your problem, then solve your problem, then iterate” variety; unless, of course, you want an analytics method to sell on late-night TV.


I promise there’s no (well…very little) philosophy in ‘Measuring the Digital World’ – but I do think there is some good method! It’s available for pre-order now on Amazon.