P < 0.05 and Other Lies We Tell Ourselves
The p-value is a golem, stop treating it as anything more
By everyone, I mean myself and out of insecurity I’ve decided to make this general.
Every statistics project I’ve touched, every colleague’s deck I’ve reviewed, the p-value sits at the end like a verdict. Pass or fail. Significant or not. Almost everyone I’ve worked with runs frequentist analyses. Most of them probably don’t know that word. They’re just following the textbook they were handed in school.
I was the same. No stats background out of college, just thrown into analytics. And the p-value made sense to me because it’s clean: no priors, no assumptions, reproducible by anyone. You get a number and you move on. And thats the trap that I and a lot of data teams I have encountered.
McElreath’s Statistical Rethinking opened Chapter 1 with something that reframed how I think about all of this. Statistical models are golems, just some clay robots that follow instructions exactly and have no judgment. They don’t know when the question is wrong. They don’t know when you’re misreading the output. They just compute.
The p-value is a golem. It just computes. The problem is what we think it’s doing.
What a p-value actually says
The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true.
It doesn’t tell you your hypothesis is true. It doesn’t confirm the effect is real. It answers one question: if the world were dull and nothing were happening, how often would data this weird show up by accident?
Low p-value means your data looks strange in the null world. Something might be going on. It cannot say what, or how much, or whether your specific explanation is right, because many different hypotheses can produce the same result.
The coin: flip it 20 times, get 15 heads. A fair coin does that roughly 2% of the time. p = 0.02. Suspicious but not proof the coin is rigged. Grounds for doubt. That’s all.
| Result | What people think | What it means |
|---|---|---|
| Low p-value | “Real effect. Not a fluke.” | “Data fits poorly in a null world. Something might be up.” |
| High p-value | “No effect. Just noise.” | “The null explains this fine. Can’t rule out ‘nothing’ yet.” |
Where it breaks
P-hacking
Analyst needs a win. Slices the data by age, region, device, time period, runs tests until p < 0.05 lands somewhere. That number goes in the deck.
The p-value assumes you ran one test. Run 20 and you will find significance somewhere even when nothing is real. Every individual test was correct. The analyst broke it by picking the lucky one.
“Not significant” read as “no effect”
HR tests a wellness program on 40 people. Retention ticks up slightly. p = 0.15. Program gets cancelled: “data shows it doesn’t work.”
A high p-value is not evidence of nothing. It’s insufficient evidence to conclude something. The sample was probably just too small to see a real effect. Those aren’t the same situation.
Significance mistaken for importance
Product team A/B tests button color on 2 million users. Blue wins: p < 0.0001. Months of engineering follow.
The actual difference was 0.003% more clicks. At that scale, nearly everything is significant. The p-value measures consistency, not size. Effect size tells you whether to care. A tiny p with a tiny effect is still a tiny effect.
Correlation read as cause
Retail chain finds louder store music correlates with higher spend, statistically significant. Volume goes up everywhere.
Loud music was concentrated in high-income urban stores that were always going to outspend. The model found a pattern. It said nothing about why. Many different processes produce the same statistical result, which is exactly the H → P → M problem McElreath spends the chapter on.
The actual problem
All four mistakes share one root: treating a model verdict as the answer to a research question.
Testing a null hypothesis is not the goal. p < 0.05 is a gate, not a destination. It means a signal exists worth examining, not a conclusion worth acting on. The questions that matter come after: How large is the effect? What alternative explanations fit the same data? What causal process would actually generate this pattern?
The golem gives you a number. What the number means is your problem.