Don’t Ignore Bears: The Pitfalls of Summarizing Data with Medians

Some people are big fans of the median as a summary statistic. I sometimes am too. But it has some big downsides— as all statistics do — which I want us to be aware of.

Why do people suggest the median in the first place? Usually because the median ignores extreme values. If you’re hiring a new graphic designer for your startup, you may want to benchmark the salary against other salaries in your area. Assuming you could get the data, you would probably find a few really high salaries, from rock-stars working at big firms. But that’s not what you want to benchmark against: you want what’s more common. That idea of a “central tendency” is what the median can get you. The $100k guy is left out.

But this is bought at a cost: medians ignore all outliers because they ignore all value in your dataset, except the value of the single central item (or two items in the case of a tie). Relative position is all that counts. This means you’re throwing away data every time you use the median. (Statistical “inefficiency.”) Naturally, this can get you into trouble. Here are some examples.

Failure Type 1: Small Range of Values

This happens when you have a small set of options near the mid-point of your data set. This is likely when you have discrete values (1,2,3 instead of 1.234, etc) that cover a narrow range. For instance, say you have 1–5 star ratings of a product. A few bad things can happen.

Bad Thing 1A: Exaggerate Small Differences

The two products here have almost identical ratings. But one rating — the central one — is different:

Product A: 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 5
Product B: 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 5 
^
Median

Looking at median product ranking, you will perceive the products to have values of 3 and 2. But the means are only a few tenths of a point different.

Bad Thing 1B: Missing Changes

You will not notice any change where there is one if that change does not affect the median value:

Product C: 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4
Product D: 2 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 5 5 5 
^
Median

The median remains constant at 3, even though both ends of the distribution have shifted upward. The mean shifts over 20% from from 2.9 to 3.5.

Failure Type 2: Ignoring Extreme Values

People like to ignore extremes because they might be “outliers,” meaning errors. But extremes are a natural feature of most real-world systems, like salaries, plant densities, or disaster impacts. Often this means left-skewed data, with a few really large values, and many small ones:

Extremes can greatly inflate the (arithmetic) average — but that’s not necessarily a bad thing: extreme things have an outsize impact in the real-world too, and your analysis may need to reflect that. An ecologist should not ignore bears just because they’re very large!

If you use the median on skewed data, you’re ignoring rare extremes and instead focusing on what’s common. Whether this is a good idea depends on what you want to know. Imagine you run a maker space, and have data on the number of customer injuries per week:

Injuries: 1 1 1 1 1 1 1 1 2 2 2 3 3 5 8 16 40
^
Median
Mean: 5.2

Mean and Median Answer Different Kinds of Question

Besides never allowing “battle robot Thursday” ever again, what can we say? It depends:

  • If we want to know “How many injuries can I look forward to this week?” then the median may work well. Assuming the past is a good predictor, you have a 50% chance of 2 of fewer injuries.
  • If we instead wonder, “How many more injuries will I get if I open for another two weeks?” then we need the (arithmetic) mean. The mean is a way of pretending all days are the same, while keeping the total constant. Most days have fewer than 5 injuries; but if every day did have 5, nothing ultimately changes. And we can estimate another 5.2 x 2 = 10.4 injuries if we’re open more.

When do Extremes Matter?

Don’t just pick the number that’s most appealing: 2 injuries sounds better than 5, but is not a good way to plan ahead. Two quick questions can help you evaluate whether you need a mean instead of that median:

  • If the largest value were 10 times larger, would I care?
  • Am I actually asking, “What if every value were the same?”

Take-Away Message

We all need summary statistics: we can’t just present raw data, visually or in tables. Almost every regression analysis is basically relying on them too. They aren’t going away. So our choice is to be better informed about different statistics, and when to use them wisely.

Consider whether you have a small range of values in your data, and whether extremes should matter. If so, the mean may be preferable to the median. Visualize your data. And don’t use any statistic without thinking about it!


Don’t Ignore Bears: The Pitfalls of Summarizing Data with Medians was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Please follow and like us:
error0