Data Storytelling, Part III: Knowing Your Subject and What Not to Say

How to do data storytelling - Part III

Suppose that right now, at the beginning of April, we tested the entire United States for the coronavirus that causes COVID-19. The test’s accuracy is 95%.

About 325 million test results come in – and a friend calls you to say that hers has come back positive.

How likely is it that she has the virus?

Estimate the odds, from zero to 100%. By the time this post is done, you’ll have a sense of how close you got to a reasonable estimate. 

More importantly, I hope you'll understand how important it is for us to choose very carefully what we communicate when we tell data stories. As I discussed in Part I of this data storytelling series, data stories communicate new insights about issues that may be poorly understood – unlike dashboards, which generally show status about things we already understand. Knowing what not to say is a key issue in data storytelling.

Setting Limits to Prime Your Reader’s Intuition

Before we get there, though, let’s try a thought experiment. Imagine you’re taking a test for a nasty theoretical virus: one that was defined as a new RNA sequence, but which does not actually exist.

You’re told that the test is 99% accurate. If this virus were ever to be created, naturally or through biotech practices, this test would have a 99% chance of detecting it in a victim.

Amazingly, the test comes back positive.

Should you panic? Do you have the virus? Should you get a second opinion? What are the odds that you would actually have the virus they’re testing for?

The correct answers are, respectively, no, no, don’t bother, and zero.

Despite the test result, you don’t have this virus. Nobody on the planet does – it doesn’t actually exist.

And a positive test result isn’t so amazing, really: All medical tests sometimes show false positives. If you tested a million people for a virus that doesn’t exist, using a test with a 10% false positive rate, then a hundred thousand healthy people would get a positive result. You’re just one of the lucky 100,000.

Finally, consider this: If everyone on the planet had the virus, then any negative test result you got would be a false negative, too.

Add it all up, and it becomes clear that the odds of any given positive or negative result being accurate can be from zero to 100%, and it depends in part on the number of people in the population who actually have the virus.

In making this example, I’ve tried to clear away your objections before you even know what I expect you to object to. If I was successful, I’ve primed your intuition to accept the fact a that positive result might not mean you’re likely to have the virus – that in this case it’s not even possible for the result to be right – even if the test is “99% accurate”. And I hope I’ve primed you to accept the idea that this is all related to the number of people who actually have the virus, though I haven’t explicitly stated this yet.

Choosing What Our Stories Communicate

People have to make rational decisions in the light of the information they have. Unfortunately, some of the data we have (99% accurate test) can obscure the information that people need (0% likelihood of illness). Quarantining 50,000 people for a virus that doesn’t exist would be irrational – but people who want to be data-driven might make that mistake because they don’t understand what the data is actually telling them.

As data analysts, we need to make it easy for people to grok what’s going on. They don’t need to understand, or even see, every number.

We should avoid stating facts that hamper our readers’ understanding – even when those facts were important to us on our way to understanding their implications. In my first data storytelling post, I said, “analysts should regularly follow two distinct processes: One to attain the insights the business needs, and [data storytelling] for communicating those insights.” We shouldn’t use every data point from the first process as we execute the second one.

Let’s go farther down this rabbit hole by grappling with multiple issues: definitions, the expectations set when our readers see things like “99% accurate,” and what we choose to communicate about how we came to our conclusions.

Knowing Your Definitions

If you know medical tests (or Bayesian analysis), you might have balked at my statement that the test is “99% accurate.” Medical tests don’t generally have “accuracy.” They have sensitivity and specificity.

Sensitivity is the likelihood that, if a person has the virus, the test will correctly show a true positive: It’s the true positive rate. From the sensitivity, you can deduce the false negative rate: the odds of a person who has the virus getting a result that says she’s clean. This is what people usually mean when they say “accuracy,” but it’s not even half of the story.

Specificity is the likelihood that, if a person doesn’t have the virus, the test will correctly show a true negative: It’s the true negative rate. From the specificity, you can deduce the false positive rate: the odds of a person who doesn’t have the virus getting a result that says she’s sick.

For my examples, I’ll assume a sensitivity of 99% and a specificity of 90%. Each test would be different, but from my understanding, these are better than many of the COVID-19 tests have been up to this point.

 It feels like we know enough to communicate our testing scenario, doesn’t it? Let’s put this into a chart.

Think about what this chart represents. You might get a good feeling from the fact that the columns add up to 1, but you shouldn’t. They do so by definition, so that doesn’t tell us anything new. What we’re looking for is the odds that a positive test result is a true positive vs. a false positive – and the rows don’t add up to 1. Nor is the sum of all four boxes equal to 1.

We can’t naively interpret the rows and columns as helping us solve for probabilities. There’s something more going on here. (Good thing the sensitivities are specificities are different, or we might have made a critical error!) 

We’re missing a crucial factor, based on these definitions: We have to know (or estimate) the odds that the test subject does or doesn’t have the virus in the first place.

Getting the Full Answer

Let’s set boundaries again.

We’ll start with our theoretical virus example, where nobody can possibly have the virus, and let’s run our test against 1000 people. (Remember, we’re not talking about test results here. We’re estimating or asserting the actual number of people who really have the virus.) Then all 1000 cases fall within the second column (Doesn’t have the virus), so we will get 900 true negatives (1000 * .90) and 100 (1000 * .10) false positives.

If we start with a virus that everyone in the world has, then all 1000 cases fall within the first column (Has the virus). We’ll get 99 true positives (1000 * .99) and 1 false negative (1000 * .01).

If we start with a virus that affects 50% of the population, then 500 cases fall within the first column and 500 cases fall within the second. Of the 500 cases in the first column, 99% (495) will get true positives and 1% (5) will get false negatives. Of the 500 cases in the second column, 90% (450) will get true negatives and 10% (50) will get false positives. Note that when you add up all the cases, you’ve covered all 1000 of your test subjects: 495 + 5 + 450 + 50.

 When we use a population of 1000, each box represents the number of people who would fall into that category (true positive, false negative, true negative, false positive). We can instead use a population of 1, in which case each box represents the odds that any random person will land into one of those four boxes.

Our approach, then, is this: Multiply the first column by the estimated number of people who have the virus. (This is called the prior probability of them having it.) Multiply the second column by the estimated number or proportion of people who do NOT have the virus.

Okay, we’ve solved for the situations that we had in our thought experiment, and we’ve come to our general method. What about the coronavirus?

 As I’m writing this, there are about 164,000 “confirmed” cases of COVID-19 in the US. (Confirmed how? By tests, of course. Is this a little circular? You bet it is. That’s one of the hazards you should explain – the certainties are lower than we’d like them to be.) Let’s estimate there’s 100 times that number of virus carriers, or 16,400,000. The US population is about 330,000,000, so the number of people who have the virus (the left column) is 16,400,000 / 330,000,000 = 4.97%. The right column represents 95.03% of the population.

We’re almost there. We still want to help our friend. Given her positive test result, what are the odds that she has the virus?

She was in the top row. The odds of her having a true positive is the left box divided by the sum of both boxes: 0.0492 / (0.0492 + 0.0950) = 34%. It’s still more likely, by a factor of two to one, that she’s clean.

This is not symmetrical. If another friend calls you to say that he got a negative result, that’s fantastic news: He has a 99.94% chance of being clean, and only a 0.06% chance of having the virus.

By the way: If our estimate was that the true rate in the population was only 10 times the confirmed rate, then her odds of having a true positive would have been about 5%, a false positive 95%. More on that in a moment.

Check Your Results

So make your assumptions about the actual number of COVID-19 cases in America. Change assumptions about the sensitivity and specificity of the tests. Then run your numbers and see how well you guessed in the opening section. How did you do?

I guarantee that most healthcare professionals, and certainly the general public, did not guess well. They don’t understand everything I just went through, and we shouldn’t expect them to. They have lives to lead, and we’re the data wonks: It’s up to us to crunch the numbers and help them understand what they mean.

We have to help them make significant decisions – and that means data storytelling.

Back to Data Storytelling

I’ve gone through a ton of detail because you’re my audience and you’re more data-savvy than most other people. I haven’t been engaged in data storytelling, either: I’ve been making an argument. So let’s think about what we, as data storytellers, would say to your friend.

First, we should find out any important details that could change our analysis. Does she live in Manhattan? There’s a much higher proportion of people with the coronavirus there. Has she presented to a hospital with symptoms related to COVID-19? Clinical diagnostics might indicate a greater likelihood of the virus. Is she a healthcare worker who has been around other COVID-19 patients? That would also increase her odds.

But assuming she has no reason beyond the test result for believing that she has the virus, the conversation can be pretty straightforward.

Start in medias res: “Believe it or not, the odds are pretty low that you actually have the virus. I’d estimate somewhere between five and 35 percent.”

Give the smallest amount of relevant background you can: “These tests are really good at detecting that you don’t have the disease, but not so good at detecting when you do have it.” Maybe back that up with a simple chart.

Try to be compelling, but if you have to choose between clear and clever, choose clear. As Orson Scott Card has said about fiction, “Better a Duh… than a Huh?” 

Keep your scope limited. For the luvva Pete, don’t distract her with all of the stuff I ran through in this blog post. It doesn’t matter to her.

And give her information about how her choices affect her life, much of which doesn’t have anything to do with the analysis we’ve walked through in this post: “Look, 33% isn’t much, but it’s a lot more than zero. Isolate yourself, especially from your grandmother, as much as possible. Repeat the test if possible to confirm or refute it. Watch for symptoms, and call your doctor immediately if they arise.”

Know Your Audience

I made the storytelling suggestions above because your audience was your friend. Find the right messages to deliver to other audiences, such as policymakers. Here are two relevant examples:

  • Early testing of large, mostly clean populations has extremely little benefit. If we had tested all of America for the virus back in January or February, as many people were calling for, it would have been a colossal waste of time. Any true positives would be swamped by hundreds or thousands of false positives.
  • Testing people who show no sign of having the coronavirus may well deplete the number of tests you have while providing you with more false positives than true ones. For as long as tests are at a premium, it’s worth reserving them for people who have a good clinical or social reason to believe that they have been exposed to the coronavirus. (I still hear people complaining about this.)

Finally, save all of your research. Whether it’s in a Jupyter notebook, a PowerPoint presentation, or a blog post, demonstrate the steps that you’ve taken to reach your conclusions – because someone is going to ask how you came up with your answers. Savvy people don’t often trust black boxes.

Conclusion

This post is much longer than usual, but I thought it was important to bring out all of the detail needed for this kind of argument. What do you think? Tell me in the comments.

Jake Freivald

Jake Freivald spends his time at the intersection of data management, analytics, and business communication. His twenty-five year career in software includes developing and marketing products for data warehousing, data...

More About Jake Freivald