How a Cup of Tea Laid the Foundations for Modern Statistical Analysis

Scientific experiments run today are based on research practices that evolved out of a British tea-tasting experiment in the 1920s.
Image may contain Cooking Soaking Ingredients Baby Person and Cup
Photograph: Getty Images

If you buy something using links in our stories, we may earn a commission. Learn more.

In the early 1920s, a trio of scientists sat down for a break at Rothamsted agricultural research station in Hertfordshire, UK. One of them, a statistician by the name of Ronald Fisher, poured a cup of tea, then offered it to his colleague Muriel Bristol, an algae specialist who would later have the plant C. muriella named after her. Bristol refused, as she liked to put the milk in before the tea. Fisher was skeptical. Surely it didn’t matter? Yes, she said, it did. A cup with milk poured first tasted better.

“Let’s test her,” chipped in the third scientist, who also happened to be Bristol’s fiancé. That raised the question of how to assess her tasting abilities. They would need to make sure she was given both types of tea, so she could make a fair comparison. They settled on pouring several cups, some tea-then-milk and others milk-then-tea, then getting her to try them one at a time. But there were still a couple of problems. Bristol might try to anticipate the sequence they’d chosen, which meant cups needed to arrive in a genuinely random order. And even if the ordering was random, she might get a few correct by chance. So there would need to be enough cups to ensure this was sufficiently unlikely.

Fisher realized that if they gave her six cups—three with milk first and three with milk second—there were 20 different ways they could be randomly ordered. Therefore, if she simply guessed, one in 20 times she’d get all six correct. What about using eight cups instead? In this situation, Fisher calculated there were 70 possible combinations, meaning there was a one in 70—or 1.4 percent—probability she’d get the sequence right by sheer luck. This was the experiment they decided to run with Bristol. They poured eight cups, four of each type, and got her to test them in a random order. She named the four she preferred, and the four she disliked, then they compared her conclusions with the true pattern. She’d got all eight correct.

The reason for Bristol’s success was ultimately down to chemistry. In 2008, the Royal Society of Chemistry reported that tea-then-milk will give the milk a more burnt flavour. “If milk is poured into hot tea, individual drops separate from the bulk of the milk and come into contact with the high temperatures of the tea for enough time for significant denaturation to occur,” they noted. “This is much less likely to happen if hot water is added to the milk.”

Fisher later described the tea-tasting experiment in a 1935 book titled simply The Design of Experiments. Among other things, the book summarized the crucial techniques they’d pioneered in that Rothamsted tea room. One was the importance of randomization; it wouldn’t have been a rigorous test of Bristol’s ability if the ordering of the cups was somehow predictable. Another was how to arrive at a scientific conclusion. Fisher’s basic statistical recipe was simple: start with an initial theory—he called it the “null hypothesis”—then test it against data. In the Rothamsted tea room, Fisher’s null hypothesis had been that Bristol couldn’t tell the difference between tea-then-milk and milk-then-tea. Her success in the resulting experiment showed Fisher had good reason to discard his null hypothesis.

But what if she’d only got seven out of eight correct? Or six, or five? Would that mean the null hypothesis was correct and she couldn’t tell the difference at all? According to Fisher, the answer was no. “It should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation,” he later wrote. “Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.” If Bristol had got one or two wrong, it didn’t necessarily mean she had zero ability to distinguish milk order. It just meant the experiment hadn’t provided strong enough evidence to reject Fisher’s initial view that it made no difference.

If Fisher wanted experiments to challenge null hypotheses, he needed to decide where to set the line. Statistical findings have traditionally been deemed “significant” if the probability of obtaining a result that extreme by chance (i.e. the p-value) is less than 5 percent. But why did a p-value of 5 percent become such a popular threshold?

It came down to a combination of copyright and convenience. In a 1908 paper, the statistician William Sealy Gosset had investigated how randomness in data could influence data analysis, with the paper containing pages of statistical tables on the influence of randomness. Fisher was keen to draw upon this research, but was cautious about lifting the copyrighted tables directly. So instead he reframed them, and found that a suitable p-value for determining statistical significance suggested by the work—of around 4.6 percent—lined up neatly with some calculations he had already been doing. It was easy to round it up to 5 percent.

When Muriel Bristol picked those cups, there was a 1.4 percent chance she would get that many correct. In Fisher’s eyes, this provided “significant” evidence that his null hypothesis was wrong. As he would later put it, a p-value below 5 percent meant “either an exceptionally rare chance has occurred or the theory is not true.”

The statistical comparison used in that tea-room experiment would become known as “Fisher’s exact test,” but not everyone was convinced Fisher had got his approach exactly right. In his experiments, Fisher was interested in testing whether the null hypothesis was wrong, not in deciding which hypothesis was correct. Suppose Muriel Bristol had got a couple of cups wrong. On balance, should we conclude that she couldn’t tell the difference? Or that she could? As we’ve seen, Fisher’s test dodges making a choice in this situation; it doesn’t come to any conclusion.

Statisticians Jerzy Neyman and Egon Pearson (the son of Karl Pearson, who first coined the p-value) didn’t think this was good enough. If they started with two hypotheses—such as whether someone can or can’t tell the difference between cups of tea—they didn’t want a method that refuses to choose. According to Neyman and Pearson, researchers need a way to decide which hypothesis to accept and which to reject.

This decision-based attitude to statistics is analogous to the approach taken in legal cases. Much like legal decisions, Neyman and Pearson’s approach requires us to decide on the burden of proof: faced with a particular piece of evidence, how skeptical should we be? If we’re easily persuaded, we’ll end up accepting many hypotheses, whether or not they’re true. In contrast, if we set the bar for evidence very high, we’ll throw out most hypotheses that are false, but also disregard many that are true.

To deal with this trade-off, Neyman and Pearson introduced two concepts that would go on to plague statistics students: type I and type II errors. The first error occurs when we incorrectly accept the false hypothesis; the second happens when we incorrectly reject the true hypothesis.

Consider the Blackstone ratio, which suggests that it’s better to have 10 guilty people incorrectly released than have one innocent person imprisoned. In essence, the ratio is saying that, when it comes to criminal justice, the chance of a type I error should be 10 times smaller than the chance of a type II error. In medical studies, a ratio 4 is commonly used instead: the popular threshold for a type I error is a probability of 5 percent (thanks to Fisher) but 20 percent for a type II error. We don’t want to miss a treatment that works, but we really don’t want to conclude that a treatment works when it doesn’t.

Fisher did not take Neyman and Pearson’s criticisms well. In response, he called their methods “childish” and “absurdly academic.” In particular, Fisher disagreed with the idea of deciding between two hypotheses, rather than calculating the “significance” of available evidence, as he’d proposed. Whereas a decision is final, his significance tests gave only a provisional opinion, which could be later revised. Even so, Fisher’s appeal for an open scientific mind was somewhat undermined by his insistence that researchers should use a 5 percent cutoff for a “significant” p-value, and his claim that he would “ignore entirely all results which fail to reach this level.”

Acrimony would give way to decades of ambiguity, as textbooks gradually muddled together Fisher’s null hypothesis testing with Neyman and Pearson’s decision-based approach. A nuanced debate over how to interpret evidence, with discussion of statistical reasoning and design of experiments, instead became a set of fixed rules for students to follow.

Mainstream scientific research would come to rely on simplistic p-value thresholds and true-or-false decisions about hypotheses. In this role-learned world, experimental effects were either present or they were not. Medicines either worked or they didn’t. It wouldn’t be until the 1980s that major medical journals finally started breaking free of these habits.

Ironically, much of the shift can be traced back to an idea that Neyman coined in the early 1930s. With economies struggling in the Great Depression, he’d noticed there was growing demand for statistical insights into the lives of populations. Unfortunately, there were limited resources available for governments to study these problems. Politicians wanted results in months—or even weeks—and there wasn’t enough time or money for a comprehensive study. As a result, statisticians had to rely on sampling a small subset of the population. This was an opportunity to develop some new statistical ideas. Suppose we want to estimate a particular value, like the proportion of the population who have children. If we sampled 100 adults at random and none of them are parents, what does this suggest about the country as a whole? We can’t say definitively that nobody has a child, because if we sampled a different group of 100 adults, we might find some parents. We therefore need a way of measuring how confident we should be about our estimate. This is where Neyman’s innovation came in. He showed that we can calculate a “confidence interval” for a sample which tells us how often we should expect the true population value to lie in a certain range.

Confidence intervals can be a slippery concept, given they require us to interpret tangible real-life data by imagining many other hypothetical samples being collected. Like those type I and type II errors, Neyman’s confidence intervals address an important question, just in a way that often perplexes students and researchers. Despite these conceptual hurdles, there is value in having a measurement that can capture the uncertainty in a study. It’s often tempting—particularly in media and politics—to focus on a single average value. A single value might feel more confident and precise, but ultimately it is an illusory conclusion. In some of our public-facing epidemiological analysis, my colleagues and I have therefore chosen to report only the confidence intervals, to avoid misplaced attention falling on specific values.

Since the 1980s, medical journals have put more focus on confidence intervals rather than standalone true-or-false claims. However, habits can be hard to break. The relationship between confidence intervals and p-values hasn’t helped. Suppose our null hypothesis is that a treatment has zero effect. If our estimated 95 percent confidence interval for the effect doesn’t contain zero, then the p-value will be less than 5 percent, and based on Fisher’s approach, we will reject the null hypothesis. As a result, medical papers are often less interested in the uncertainty interval itself, and instead more interested in the values it does—or doesn’t—contain. Medicine might be trying to move beyond Fisher, but the influence of his arbitrary 5 percent cutoff remains.

Excerpt adapted from Proof: The Uncertain Science of Certainty, by Adam Kucharski. Published by Profile Books on March 20, 2025, in the UK.