Correlation Still Matters Why We Shouldn't Abandon Association in Our Rush to Causation
A critical look at why correlation isn't as misleading as some argue, illustrated with real-life data science examples. (transcribed)
So yeah, I just wanted to get started with something that’s been on my mind lately. You know, as data scientists, we’ve all internalized this mantra: “correlation doesn’t mean causation.” We repeat it so often that we’ve kind of started distrusting correlation altogether. Whenever we do analysis, come up with some metrics, find a negative or positive value—immediately it’s “well, correlation doesn’t mean causation,” as if that makes the number meaningless.
We all say we should be proving causation, but then, honestly, we rarely do. Maybe we try some causal inference stuff with observational data, but once it gets complex, most of us just… leave it. We don’t actually go deep into estimating the causal effect.
But here’s what I’ve been thinking: correlation still matters. And I want to discuss why, and when it’s actually fine to just use correlation, and when we genuinely need to move beyond it. This isn’t going to be a technical tutorial on frameworks—just, if you think in a first principles way, why correlation is valuable, and where it becomes necessary to look for something more.
The Context We’ve Lost
Here’s the thing we often forget. Statisticians and econometricians have been using correlation and regression analysis effectively for decades—like, in social sciences and economics, this is their bread and butter. As Professor Joshua Angrist always says, “econometrics is the original data science.” And that’s totally true.
For example, take something simple like price and sales in a consumer brand—Coca-Cola, Pepsi, whatever. Every week they change the price, and you have data for sales that week. You compute correlation between price and sales. Let’s say it’s -0.45—negative. You interpret this as, “okay, when price goes up, sales tend to go down.” Clear, right?
Then someone pipes up, “but correlation doesn’t mean causation!” and suddenly the whole thing is in doubt. Like, are we supposed to just throw away this number?
See, there are two fundamentally different questions at play here:
- What relationship can we observe in the data? (correlation)
- What happens when we intervene, when we change something? (causation)
And honestly, sometimes we only need the answer to the first question.
When Correlation Is Actually Enough
Correlation sits at the first rung of Judea Pearl’s causal ladder—association. It answers a surface-level question: what patterns do we see? And sometimes that’s exactly all we need.
Think about customer data. You find a strong correlation between household income and average purchase value. Now, you can’t intervene and tell someone “hey, earn less so I can check what changes in your purchasing behavior,” right? That’s neither possible nor ethical.
And yes, we have natural experiments now—this is what won Card, Angrist, and Imbens the 2021 Nobel Prize. Card’s famous minimum wage study used New Jersey raising its minimum wage while Pennsylvania didn’t. Angrist used the Vietnam draft lottery as essentially random assignment to military service. These are brilliant ways to estimate causal effects when you can’t experiment directly.
But here’s the thing: these natural experiments answer a different question than correlation.
Correlation tells you: “high-income customers tend to spend more.”
Natural experiments tell you: “when income rises due to an external shock, purchases increase by X amount.”
The first helps with customer segmentation, inventory planning, store placement. The second helps predict what happens if there’s economic growth or policy changes. Both valuable! Both needed! But for different purposes.
Sometimes—often, actually—you just need the correlation.
The Metaphysical and Epistemological Side of Causation
Okay, this is where things get deeper, almost philosophical. And I think this is the part we don’t talk about enough.
Causation isn’t just a technical upgrade from correlation—it’s a metaphysical claim about the architecture of reality itself. When I say “A causes B,” I’m not just noting that they occur together. I’m claiming there’s a mechanism, a story, a chain of events connecting them. Correlation doesn’t need this. It just observes. But causation? Causation demands we understand—or at least believe in—the pathway from cause to effect.
Physical Mechanisms
Take COVID transmission in movie theaters. When we say “going to theaters increases COVID risk,” we’re describing a physical mechanism: infected person exhales → respiratory droplets with virus particles become airborne → droplets circulate through theater ventilation → others inhale these particles → virus enters respiratory system → infection occurs.
This is physics. Fluid dynamics. Biology. You can model airflow, measure viral loads, calculate transmission probabilities based on distance and time. The mechanism is tangible, measurable, traceable. When we shut theaters, we’re physically breaking this chain.
Behavioral and Social Mechanisms: The “Metaphysical” Layer
But now consider something like a government cash transfer program increasing women’s financial autonomy. The mechanism here isn’t droplets in air—it’s behavioral, psychological, social. Almost metaphysical in the philosophical sense, because it exists in the realm of human consciousness and social structures rather than pure physics.
Take India’s direct benefit transfers to women. The causal chain might look like: cash transfer → goes to woman’s bank account (not husband’s) → increased control over financial decisions → ability to save without permission → gradual savings accumulation → investment in education or microenterprise → improved household welfare.
Every arrow here is a jump across psychology, social norms, power dynamics. Economists have documented these pathways, but they’re not “physical” like virus transmission. They operate through beliefs, behaviors, social structures—things that exist in human minds and cultures.
Why This Distinction Matters So Much
For physical mechanisms:
- You can model them mathematically with precision
- Lab experiments can isolate the mechanism
- The mechanism is stable across contexts (gravity works the same everywhere)
- Strong correlation + known physical mechanism = pretty good evidence of causation
For behavioral/social mechanisms:
- Context is everything (what works in Kerala might fail in Bihar)
- Multiple pathways might exist simultaneously
- The mechanism itself evolves over time as culture shifts
- Correlation is much weaker evidence because of confounding social factors
This is why we trust “smoking causes lung cancer”—we’ve mapped the carcinogens, the DNA damage, the cellular mutations. But “minimum wage causes unemployment”? That’s behavioral, contextual, debatable even with the best natural experiments.
The Epistemological Challenge: How Do We Know?
And this brings us to epistemology—how do we know what we claim to know?
Correlation is empirical, observable. We can compute it, test its statistical significance, replicate it. But causation, especially for behavioral mechanisms, requires us to believe in invisible chains of events. Even with RCTs, we’re inferring that our intervention caused the outcome through some mechanism we hypothesize but can’t always directly observe.
This isn’t a weakness—it’s just the nature of knowledge. But we should be honest about it. When we move from correlation to causation, we’re not just upgrading our statistics. We’re making philosophical commitments about how the world works.
Where Correlation Falls Short (And It’s Not Correlation’s Fault)
The classic stupid examples—ice cream sales correlating with shark attacks, or comparing foot traffic in an Indian store with sales in a US store—these aren’t failures of correlation. They’re failures of thinking. Poor hypothesis formation, terrible experimental design, asking correlation to do a job it was never meant for.
The real limitations show up when we need to make decisions about interventions. Should we raise the minimum wage? Close the theaters? Launch this marketing campaign? These questions aren’t asking “what patterns exist?” but “what will happen if we act?” That’s causation’s domain.
A Practical Framework for Thinking About This
So when should you use what?
Use correlation when you need to:
- Explore and understand your data
- Detect patterns worth investigating
- Segment customers or markets
- Generate hypotheses
- Describe the world as it is
- Make predictions when patterns are stable
Move to causal analysis when you need to:
- Evaluate interventions (did our campaign work?)
- Make policy decisions
- Understand mechanisms (why does this happen?)
- Predict effects of changes you’ll make
- Justify actions with evidence
The sophistication ladder looks like:
- Start with correlation to map patterns
- Use domain knowledge to hypothesize mechanisms
- Look for natural experiments that create quasi-random variation
- Design real experiments where possible and ethical
- Apply causal methods (instrumental variables, regression discontinuity, difference-in-differences, synthetic controls)
- Always question: what assumptions am I making?
The Journey Never Really Ends
Look, I’ve barely scratched the surface here. There’s a whole universe of methods I haven’t touched. The potential outcomes framework alone has propensity score matching, doubly robust estimation, synthetic controls. Pearl’s causal hierarchy gives us DAGs, d-separation, backdoor and frontdoor adjustments. Then there’s mediation analysis, heterogeneous treatment effects, time-varying treatments…
The complications never stop. What about interference between units? What if your treatment has multiple versions? What if you don’t even know the causal structure? Each of these has spawned entire research fields.
But I hope this is a good starting point for thinking differently about correlation and causation. They’re not competitors—they’re tools for different jobs. Correlation isn’t a consolation prize when you can’t prove causation. It’s the right tool when you need to understand patterns, and the wrong tool when you need to predict interventions.
The journey from correlation to causation isn’t always necessary, isn’t always possible, and when it is both, it’s rarely straightforward. But understanding when you need which tool—and being honest about what each can and cannot tell you—that’s where real data science happens.
A Final Thought
Next time someone dismisses your analysis with “correlation doesn’t imply causation,” don’t just nod along. Ask them: “Did I claim it did? And for this specific question, do we actually need causation?”
Sometimes mapping the world as it is—that’s the whole point. Sometimes we need to dig deeper into the why. Both roads matter. Both are worth understanding. And honestly, being philosophical about it—thinking about the metaphysical claims we make, the epistemological standards we hold—that makes us better data scientists.
Because in the end, whether we’re computing correlations or estimating causal effects, we’re trying to understand reality. And reality, it turns out, is both simpler and more complex than our methods suggest.
Keep questioning. Keep learning. And always, always ask: “What question am I really trying to answer here?”
PS: If you found this interesting and want to dive deeper, start with Pearl’s “Book of Why” for the philosophical side, or Angrist and Pischke’s “Mostly Harmless Econometrics” for the practical side. Both changed how I think about these problems.