Community Perspectives on Our Article About Observational Causal Inference

A few weeks ago, GDP’s Bill Grosso and Julian Runge wrote an article about the potential pitfalls of observational causal inference modeling—Combating Misinformation in Business Analytics: Experiment, Calibrate, Validate—with a particular focus on Media Mix Modeling. The article originally appeared as a guest post on Eric Seufert’s Mobile Dev Memo and has also been reposted on the GDP blog. It sparked a wide variety of comments on LinkedIn (and an article on Ad Exchanger), and we decided to collect the community response.

But before we get to the responses, let’s quickly summarize the background for writing the article in the first place.

Evidence increasingly reveals that observational causal inference (OCI)—methods that infer cause-and-effect relationships from existing data—often leads to misjudgments about the impact of business strategies. Unlike randomized controlled trials (RCTs), OCI relies on naturally occurring data patterns, which are susceptible to biases and unobserved variables. These inaccuracies can result in flawed conclusions about business effectiveness, risking wasted resources and harm to market position. Accurate insights are vital for guiding investments, pricing, and marketing strategies, making rigorous experimental validation essential in contexts where causality drives financial and strategic outcomes.

Bill and Julian discuss the limitations of OCI in business analytics, citing methods like Media and Marketing Mix Modeling (m/MMM), which often misattribute causality due to issues like endogeneity and omitted variable bias. They advocate for prioritizing experimental approaches, such as A/B tests and RCTs, to establish causal clarity. Additionally, they recommend using experimental results to calibrate observational models, correcting biases and improving accuracy. By integrating experimentation with model calibration, businesses can enhance analytics reliability and make better-informed decisions.

Two Significant Recent Papers that Prompted the Original Article

The pace of academic articles about possible issues with observational causal inference has increased in recent years. In particular, Julian and Bill cited two recent papers. The first paper was Observational Price Variation in Scanner Data Cannot Reproduce Experimental Price Elasticities, by Robert Bray, Robert Evan Sanders, and Ioannis Stamatopoulos.

The authors analyzed 389,890 randomized in-store supermarket prices across 409 products in 82 test stores and found that experimental price elasticity averaged -0.34, while observational data from 34 control stores suggested an elasticity of about -2.0. This highlights a significant mismatch between observational and experimental estimates of demand elasticity. Observational data suggest that retailer prices are in the elastic range, whereas experimental results indicate pricing in the inelastic range. This discrepancy cannot be attributed to typical factors like estimator properties, price variation processes, or elasticity timeframes. The findings challenge the reliability of observational demand elasticity estimates and raise questions about standard economic models’ applicability to retail pricing.

Julian and Bill also cited Close Enough? A Large-Scale Exploration of Non-Experimental Approaches to Advertising Measurement, by Brett R. Gordon, Robert Moakler, and Florian Zettelmeyer, in which the authors evaluate the accuracy of non-experimental methods in estimating the causal effects of digital advertising. Utilizing data from 15 Facebook advertising experiments across 11 brands, the researchers compare experimental results with those derived from observational models, including matching, inverse probability weighting, and regression. The findings reveal that these non-experimental approaches often produce biased estimates, with the direction and magnitude of bias varying across brands and methods. This variability underscores the challenges of relying on observational data for advertising measurement.

Concurring Commentary from AdExchanger

James Hercher, in his article Learning To Love And Let Go Of Attribution Models on AdExchanger, makes a strong case for our article, saying that while mix models and other attribution approaches have been essential tools for understanding marketing impact, they often fall short in today’s complex media landscape.

Hercher highlights the limitations of models like media mix modeling (MMM) and multi-touch attribution (MTA), which can misattribute causality due to factors such as endogeneity and unseen variables behind emerging MMM tooling, echoing the concerns we raise in our article.

There are other reasons not to trust the MMM trend as a ‘truthier’ attribution fallback, now that multitouch attribution and user-lever tracking is infeasible.

And that’s because MMM might just become another walled garden platform plaything.

Earlier this year, Google open-sourced its own MMM product, which it calls Meridian. Meta has an open-source MMM solution it calls Robyn, while Amazon’s is still a proprietary product, not open-source.

But platform MMM is the same as platform anything. It’s there to prove the platform succeeded, as much as that your marketing worked. Google’s Meridian, for example, is really good at tying together search, YouTube, TV and Google Ads campaigns.

Hercher argues that these traditional models, while helpful in a simpler media environment, are now less effective at navigating the fragmented, multichannel advertising landscape. Instead, he supports our emphasis on integrating experimentation and validation, underscoring that models should not be relied upon as standalone truths. He agrees that mixing empirical experimentation, such as A/B tests, with calibration of observational data allows marketers to correct for biases and improve attribution accuracy.

Hercher’s view aligns with our stance that business analytics must move beyond traditional attribution models and embrace an iterative, hybrid approach to better capture the causal effects of marketing strategies and guide informed decision-making.

Discussion in the Community

The article also resonated with many of our peers in the community, generating a variety of thoughtful comments and discussions after Bill and Julian posted their article on LinkedIn.

The team at Haus, a startup marketing science platform that helps companies measure the incremental ROI of online and offline ad spend, had a lot to say about our take on MMM and causal analysis.

Zach Epstein, Founder and CEO at Haus, agrees in principle and notes that experiments are hard:

This is a great article that covers a lot of the issues we see day in and day out. What I think is less appreciated, especially in the world of advertising, is how difficult it is to run great experiments. The concept of running experiments alone won’t solve this problem – there’s a tremendous need for increasing access to world class methods and infrastructure.

Running an experiment is easy. Running an experiment that you’d bet your own money on is extremely difficult.

Chandler Dutton, who works on Customer Success at Haus, also agrees in principle and points out that, in practice, results do not match models:

This piece thoughtfully explains not just why I’ve been insistent on every brand I know needing to work with a partner like Haus, but why I joined the team.

Too often, marketing teams are chasing outputs from observational modeling only and trying to use those to inform multi-million dollar decisions without those models being able to prove causality. Whether for its own sake or for the purpose of calibrating such observational models, experimentation is critical and the only path towards getting really actionable data.

I’ve seen teams chase their modeled attribution results and not have their resulting investments drive the results the model pointed towards. I’ve also seen teams drive incredible success without really knowing why and what to do next to pour gasoline on the fire. The missing piece is experimentation. I’d recommend all of the growth marketers in my network read this one.

Olivia Kory, who works on Incrementality Testing at Haus, also agrees we need experiments:

Dr. Julian Runge and William Grosso just released a very important guest essay in Eric Seufert’s Mobile Dev Memo about the shortcomings of observational causal inference modeling, specifically MMM. In their words:

Evidence is mounting that observational causal inference (aka MMM) often misinforms about the actual impact of business strategies and actions, and this means we need more experimentation — for baseline evaluation of policies, for validation of observational insights, and for calibration of observational models.

“When a new drug is tested, RCTs are the gold standard because they eliminate bias and confounding, ensuring that any observed effect is truly caused by the treatment. No one would trust observational data alone to conclude that a new medication is safe and effective. So why should businesses trust OCI techniques when millions of dollars are at stake in digital marketing or product design?”

Others in the community also reacted strongly to the article.

Tony Williams, an Economist and Director of Data Science at FlowPlay, agrees and wonders if a greater focus on advanced data science techniques (matching methods, et al) might help:

Definitely excited to read this since I’ve spent the last few days looking at valid matching methods for multiple variants and have been surprised that the academic literature hasn’t covered this more. I know you’re looking at something different (MMM), but as someone who loves experimentation, there are also times we need other methods.

Very cool to see this discussion getting brought up!

Jim Kingsbury, an E-Commerce Marketing Advisor who has worked with Zappos, Allbirds, KiwiCo, and Amazon, agrees and lists vendors who routinely do AB tests to verify models:

Running geo-lift tests to validate – or, if needed, calibrate – the output of an MMM is becoming table stakes.

This essay is a great reminder of how important this is.

To marketing leaders out there who are using or evaluating MMM solutions, I recommend asking the vendor about their process to validate what their model claims.

If the vendor hems & haws in response to this question, I’d recommend finding other vendors who enthusiastically embrace this critical step.

A few vendors I know who always do this include:

SegmentStream

WorkMagic

LiftLab

I’m sure there are others who do this and I’m excited for anyone reading this post to share who they are.

A longer back-and-forth discussing the comparison of MMM to drug trials occurred between Jimmy Marsanico, VP of Product at Prescient AI, and Toma Gulea, Lead Data Scientist at Polar Analytics, debating whether comparing marketing to other verticals makes sense:

I appreciate the focus on measurement rigor, but the paper’s comparison of marketing measurement to drug trials misses some crucial real-world complexity. In clinical trials, control groups get placebos in isolation. But in digital advertising, ‘control’ users are actively shown alternative ads competing for the same share of wallet. When a control user purchases a competitor’s product after seeing their ad instead of yours, they’re naturally less likely to buy your product – not because your ad wouldn’t have worked, but because they already spent their budget elsewhere. This makes the ‘untreated’ state anything but neutral, potentially leading experiments to undercount true advertising impact.

While empirical tools like incrementality tests provide valuable data, treating any single approach as ‘table stakes’ oversimplifies the challenge. The most successful brands recognize that marketing measurement is more art than perfect science – they triangulate insights from multiple sources (sometimes by leveraging dynamic and regularly updated MMMs, like that of Prescient AI ) and combine them with strategic thinking and domain expertise.

After all, isn’t the goal to make better decisions, not just chase methodological purity?

Toma replied:

Jimmy M. But that’s actually what you want to test. For example if you were to cut your spending overall on a channel, what you describe would happen (consumers shifting to competitors) and that’s exactly these external factors you want to account for when evaluating the true impact of your ads. Am I wrong?

Jimmy clarified his take on the differences between marketing and drug trials.

Toma Gulea 🤔 I’m not quite sure we’re saying different things here. The comparison and example of pharmaceuticals is just inherently different in the approach of test/control groups because during a hold out test, your competitors aren’t holding out (but during a pharmaceutical test you’re not taking a competitor’s drug to treat your symptoms).

Alluding to your other comment below, you’re right — if you’re going to cut spend on a channel (or double it perhaps), if you’ve only spent the same amount daily on that channel, or campaign, there’s less data (or confidence) in the relationship of spend to revenue at any other spend value — making it a model to help make decisions, not a perfect crystal ball to predict impact of ALL future changes. Thus, my argument against “table stakes” — there is value… but only when used appropriately — that applies to any measurement tool.

And Toma concluded that there are more parallels than differences:

The principle is the same: isolating the effect of a treatment (or ad exposure) to estimate causal impact. Your competitor’s ad becoming more effective as a result of a change in ad exposure is absolutely part of the causal impact you want to test. The only difference with Pharma is the elimination of the “Placebo effect”.

Even in pharma, participants in control groups don’t exist in a vacuum—Consider a holdout test for a new pain medication. Some participants in the control group might be taking over-the-counter pain relievers during the trial, while the treatment participants are not. The intervention still caused participants to stop the alternative medication, leading to a better or worse outcome.

The only thing an RCT can give you is the impact of the intervention on the outcome in the real world, not the mechanism.

Your change in ad spend causes a drop in revenue because of competition, then so be it—that’s the real-world outcome you’ll get, just like an outcome for a medication is influenced by the use of an alternative medicine.

Randomized control trials are often impractical, unfeasible, or too costly, and other methods should be employed. But an RCT is an RCT, and the comparison with pharma is absolutely correct.

Separately, Toma Gulea made an interesting general observation about the differences between claims and the actual value of using MMMs.

A typical claim from MMM vendors: “testing the model’s accuracy on a separate holdout period ensures its trustworthiness”.

This misses the core purpose of an MMM. The real goal isn’t simply to predict revenue based on past marketing spend, but to uncover the causal relationship between channel spending and revenue outcomes. The key question is: ‘What would happen if I spent X?’ In situations where the marketing spend has been stable over time, evaluating accuracy on historical data is meaningless because it doesn’t assess how the model will perform when actual changes occur. When it does, your MMM will break and you will realize it’s useless.

The right approach requires a causal lens:

Start by understanding the business and marketing strategy to identify confounders and latent variables.

Then, apply causal methods and gather control and instrumental variables.

Avoid the lure of “predictive accuracy”: you can’t observe the true relationship you are trying to model. The goal is to have a useful model!

Kenneth Wilbur, Professor of Marketing and Analytics at the University of California, San Diego – Rady School of Management, made the interesting point that experimentation was viewed as important in the early papers but somehow dropped out of daily practice:

Some of the original MMM literature in the 1950s pointed out that MMMs obviously needed to be calibrated with experimental variation in spending.

An Operations-Research Study of Sales Response to Advertising, by M. L. Vidale and H. B. Wolfe (1957), demonstrates the necessity of precise, reproducible data to evaluate advertising effectiveness. Through controlled experiments, the authors identified key parameters—Sales Decay Constant, Saturation Level, and Response Constant—that define sales responses to advertising campaigns. These parameters enable the development of predictive mathematical models to optimize advertising efforts and budget allocations. The study emphasizes that well-designed experiments provide actionable insights for tailoring strategies to maximize return on investment, underlining the critical role of empirical data in refining marketing decisions.

A Media Planning Calculus, by John D. C. Little of MIT and Leonard M. Lodish of the University of Pennsylvania (1969), emphasizes the importance of experimentation in developing an effective marketing or MMM strategy by introducing a structured approach to media planning, known as the Media Planning Calculus, and recommending experimental calibration wherever feasible. The authors advocate for controlled experiments and computational modeling to measure and predict market responses to advertising. By integrating concepts like exposure frequency, forgetting, audience segmentation, and diminishing returns, the study demonstrates how experimentation refines parameter estimations, such as exposure values and response functions. This empirical grounding allows for dynamic optimization of advertising schedules and budgets, significantly improving marketing efficiency.

It is awesome to be reminded of these papers that already called out the necessity for experimental calibration of OCI and m/MMM almost 60 years ago.

We are very excited by the positive reception of our analytics strategy opinion piece. GDP is committed to precise analytics and driving forward best practices in gaming and beyond.