Marketing Measurement Has a Measurement Problem

In this essay, I want to convince you that, unless steps are taken, the future of MMM as a robust and commercially viable area for analytical decision-making products is in jeopardy. If the category continues on its current path, MMM risks joining the dust heap of analytics history alongside other once-promising technologies whose commercial claims raced ahead of any practical way to evaluate them before purchase.

That’s a bold claim, and bold claims require both disclaimers and evidence. My disclaimers are simple: I run a boutique analytics consultancy. I am not selling an MMM product, and I have no intention of entering the space. While many of my clients run advertising and user-acquisition campaigns for which MMM could be useful (indeed, is useful), the market is already crowded with firms that specialize in building and implementing these systems.

The reason I believe the claim is straightforward: over the last year I’ve heard versions of the same concern five different times from CMOs and VPs of Growth.

“Our MMM and our MMP tell us different things. We’re not sure which to trust.”
“We have an MMM. But we’re not sure it’s any good. Can you help us understand what it would take to improve it, and help us evaluate vendors?”
“There are a lot of MMMs out there. I have no idea which ones are good.”
“We have an internally built MMM and it’s useless. I showed the source-code to someone I trust and he said it’s a good implementation. But the output doesn’t make any sense, so I guess MMM doesn’t work in our vertical?”
“We do a lot of OOH advertising, where we don’t get solid impression data. MMM won’t work for us.”

Now, of course, any MMM vendor has answers for each of these, and those answers are usually well rehearsed. But those answers also tend not to address the core question beneath all five statements: how can a marketing decision-maker know, before expending the time and energy to integrate an MMM system, whether that particular MMM system is likely to drive value for them?

Why Does Knowing Before Matter?

Why does this matter? Why does a marketing decision-maker need to know, before integrating, whether they are making a good choice? Because the underlying technology is complex, the implementation process (including the integration) is costly, and, in the current marketplace evaluating a single vendor takes three to six months once you include calibration and geotesting.

In more detail:

There are a lot of MMMs on the market today. Marketing Science Today’s 2026 MeasureMap lists more than 65 vendors. It’s hard to imagine someone new to MMM successfully choosing between 65 options without guidance.
MMM vendors often claim some level of unique knowledge or secret sauce. Lifesight, for example, has a vendor comparison page which states that that its use of causal mediation analysis and halo-effect modeling are unique and differentiated (see figure 1). At the same time, Prescient AI‘s marketing materials explicitly emphasize halo effects. Without access to the formal mathematical models, or at least to clear and comparable disclosures, it is hard to understand whether Lifesight and Prescient mean different things by “halos” and which implementation is better for particular use cases.
Even when you have access to the formal mathematical models, it’s often not obvious whether a particular feature is necessary, how to use it, or whether other vendors have something similar. Consider, for example, the “spikes” feature in Recast’s MMM. Recast does an excellent job of explaining the details of their model (they published YouTube videos, they documented the model, and they published some tips for how to use spikes in practice). My experience is that about half the vendors I’ve talked to claim to have a similar feature, often with slightly different mathematical models. It’s clear that if I was running an e-commerce site, I’d want to use spikes to model events like Black Friday. But what if I was running user-acquisition for a mobile game? Is this a useful feature? Similarly, Robyn uses Ridge Regression and Multi-Objective Hyperparameter Optimization. To the best of my knowledge, only Robyn supports multi-objective optimization, but is that a compelling differentiator?
The process of integrating and tuning an MMM is complex and time-consuming. It requires upstream work to collect, clean, reconcile, and govern the underlying data; repeated iteration across model specification, calibration, validation, and stakeholder review; and ongoing collaboration among analytics, marketing, and data-engineering teams (see, for example, the following three papers for a full discussion of the process: How CMOs Can Get—and Keep—Their Marketing Mix Right, Making Data-Driven Marketing Decisions, and Challenges and Opportunities in Media Mix Modeling).
Even once an MMM is fully integrated, geo-testing or other experiments are often necessary to calibrate it properly (see, for example, this discussion by Google or Runge et al’s discussion of how to calibrate Robyn. Or, watch this recent video from PyMC for an overview of the calibration process). Calibration introduces very real costs in time, effort, and foregone certainty while the organization works out whether the model is trustworthy.

Figure 1. Lifesight claims that their causal mediation analysis and support
for halos are unique. Taken from Lifesight’s vendor-comparison documentation.

All of these combine to make choosing an MMM vendor a daunting task. Marketing decision makers often feel, perhaps correctly, that Danhausen has cursed them.

Case Studies Are Not Sufficient Evidence

MMM companies usually deal with these problems by providing customer testimonials and case studies. That’s a good starting point, but it’s not nearly enough.

For a case study to be persuasive to a marketing decision-maker, the company being studied has to be meaningfully similar to the company making the decision. Otherwise, for a decision-maker, the case study amounts to this: a handpicked customer that is not much like you is willing to publicly say the vendor’s tool helped them. That’s not nothing, but it is much weaker evidence than vendors pretend it is.

It gets worse. Using Claude Code with Opus 4.7 set to “xhigh”, I reviewed 398 distinct case studies from 58 vendors. The methodology was simple — I pointed Claude at Marketing Science Today’s 2026 MeasureMap and retrieved up to 10 case studies (the most recent 10 for companies that had more than 10 case studies) from each vendor. Claude was able to find case studies for 58 vendors; 9 additional vendors had no case studies at all that Claude could find.

(dataset available upon request, but, really, I let Claude do it. The idea was to use available tooling and public information and see what is easy to find)

Across those case studies, Claude found 87 distinct outcome metrics. Most of the metrics were business outcomes, not traditional measures of model accuracy: only three vendors mentioned MAPE, only one mentioned R^2, only four mentioned statistical significance, and none mentioned WMAPE, sMAPE, RMSE, MAE, MASE, AIC, BIC, …

Figure 2. Prompt used with Opus 4.7 to map the metrics cited in vendor case studies.

In the immortal words of the AI-slop vendors, let that sink in. Across the board, MMM vendors are producing case studies that usually do not discuss model accuracy in traditional data science terms. The striking exception that seems to prove the rule is Stella. In articles such as “Measuring the incrementality of podcast ads: a 6-month case study,” Stella breaks out MAPE by month, uses a 90 percent confidence threshold, discusses R^2, and so on.

What is Needed

If MMM is going to remain commercially credible, the category needs to give customers a way to evaluate products quickly and easily. In practice, that means three things: public measurements, public comparison data, and public third-party evaluation.

First, the field needs a publicly agreed-upon set of measurements. The plural is important: there isn’t a single metric that is capable of capturing all the aspects of MMM performance. But we do need a minimum reporting bundle that lets buyers compare models in a disciplined way: out-of-sample error, stability across re-estimation windows, calibration of uncertainty, sensitivity to modeling choices, and agreement with experimental results when experiments exist. And this needs to be a publicly agreed-upon definition, managed by a third-party or a standards body. It’s catastrophic for the long-term health of a market if every vendor gets to define performance and accuracy differently.

Second, the field needs publicly agreed-upon test datasets or, better still, public test-dataset generators. When vendors publish a case study using a hand-picked customer, the result is questionable. It might be correct, but how do we know? The fox is guarding the henhouse. We need a way to compare two MMMs, and that means we need a testbed. The usual answer here is to produce a static dataset, but dataset generators are better because they can create many scenarios with known ground truth: different carryover effects, seasonality, promotions, channel interactions, halo effects, missing impression data, and noisy or incomplete media histories. That makes it harder to overfit to a single benchmark corpus, and it makes the benchmark more relevant to the messy conditions under which MMM is actually used.

Third, the field needs a way for independent parties to evaluate system performance. Without that, standards collapse into self-attestation. A credible benchmark regime needs neutral evaluators who can run the tests, inspect the assumptions, reproduce the reported performance, and publish results in a form that buyers can use. If no third party can meaningfully evaluate a vendor’s claims, then the buyer is still being asked to trust first and understand later.

This Is a Standard Approach

None of this is surprising. Defining performance and accuracy, arguing in public about standards, and building testbeds to compare competing technologies is how software markets evolve.

Take, for example, the relational database market. In the 1980s and early 1990s, relational database and transaction-processing vendors faced a version of the same problem that MMMs face today. Everyone wanted to claim superior performance, but vendors were often relying on ad hoc or selectively interpreted benchmarks. That made the market noisy and hard to evaluate. Buyers could hear claims, but they had little reason to trust that one vendor’s numbers were directly comparable to another’s. As Turing Award winner Jim Gray put it

In the 1980s and early 1990s, database and transaction-processing vendors often relied on non-standardized benchmarks, producing the kind of “benchmark wars” that made performance claims difficult for buyers to compare; the TPC emerged in part to replace that noise with objective, more verifiable, and more comparable results.

The Transaction Processing Performance Council formalized workloads that had previously circulated in looser form, including the move from informal predecessors such as DebitCredit and TP1 to standards such as TPC-A and TPC-B. The resulting benchmarks defined the workload, how the tests had to be run, how prices were calculated, and how results were reported.

Once those rules existed, database buyers could compare claims in a common frame rather than squinting at vendor-specific demonstrations. Throughput and price/performance became comparable in a way they had not been before, and customers were able to make more informed decisions. The benchmarks did not eliminate gamesmanship, and they definitely did not eliminate overstated marketing claims, but they substantially improved the information environment.

MMM needs the same kind of test suite. It needs a public regime in which vendors can be compared on clearly specified tasks, under clearly specified rules, with enough disclosure that buyers can tell what is being measured and what is not.

Note also that public benchmarks do not make third-party validation obsolete. Independent verification against agreed-upon benchmarks helps keep everyone honest (to see why, and continuing with our database example, consider this article from Truth in Advertising).

In relational databases, for example, one of the companies I admire the most is Jepsen. Jepsen stress-tests systems against their stated guarantees (see, for example, their March 2026 report on MariaDB), publishes its methods, and is willing to share uncomfortable results. MMM does not need a literal Jepsen clone; I strongly doubt anyone is going to fault-inject an MMM the way Jepsen fault-injects distributed systems (though maybe using noisy data is a similar idea?). But it does need the equivalent institutional role: a respected outside party that pressure-tests vendor claims, documents its methods, and gives buyers a reason to trust something other than testimonials and case studies.

There Is Prior Art for This in Forecasting

Forecasting competitions are now so widespread that it is often forgotten how controversial they were when first held, and how influential they have been over the years.
Rob Hyndman

In conversation, people often object to my use of databases as an example of what the MMM community needs to do. They point out that MMM inputs are much messier than database transactions. Media data are incomplete, channels interact, promotions distort baselines, impression data is often missing, and counterfactuals are never observed directly. The point being that perhaps database-style benchmarks are possible for databases, but not for MMMs.

That seems like a weak objection to me. MMM is, at bottom, a forecasting technology: it uses historical signals to estimate how outcomes change across time and under alternative allocations. And forecasting has had decades of prior art on how to compare methods under uncertainty.

For example, consider the Makridakis competitions, which began with the 1982 forecasting competition and later evolved into the M competitions (most recently, M5 and M6). Rather than settling disputes by theory or vendor narration, these competitions published datasets and evaluation procedures, and required contestants to compete on the same problems. The competitions, and the resulting writeups, were the cause of an significant amount of progress in forecasting methodology (in the words of Rob Hyndman, they had a “profound effect on forecasting research”).

Forecasting competitions also make accuracy metrics explicit. Participants can disagree about which metrics matter the most, but the disagreement was public and tractable because the scoring rules were declared in advance and results were reported side by side. That is a dramatically healthier setup than one where each vendor picks whichever outcome measure makes a case study look strongest.

Forecasting also shows that messy, real-world data do not disqualify benchmarking. The M competitions and their descendants span many horizons, domains, and data structures, from relatively small collections of business time series to massive retail forecasting problems with hierarchical structure. In other words, the field did not wait for a perfectly neat toy problem. It embraced heterogeneity and built evaluation frameworks around it.

So the claim that MMM is too complex to benchmark is not a serious objection. At most, it is an argument that MMM benchmarks should be richer than a single leaderboard. Which is fine. Let’s build multiple benchmark families, publish the generators and scoring rules, and let third parties run the tests. The forecasting community has even provided a template for how to design competitions.

There Is Even Prior Art for This in MMMs

It’s important to note that the ideas of a validation framework, and the idea of comparing MMMs using synthetic data sets, are not new. There are several starting points already in place, that could be used to build towards the vision outlined here.

Perhaps the most important comes from Mutinex, which has published the Open MMM Validation Framework. Mutinex describes this as a vendor-neutral, open-source toolkit for objectively testing and comparing MMM solutions, including open and proprietary models and says the framework is intended to make MMM comparisons “apples to apples” and to reduce reliance on sales decks rather than evidence. Similarly, Meta’s siMMMulator could be viewed as an early implementation of the “public test-dataset generator” idea. It is an open-source project that generates simulated MMM datasets with known ground-truth ROI so that users can validate models, compare MMMs across generated business scenarios, and quantify whether a modeling innovation improves accuracy.

Both of these projects are important building blocks which help to illustrate that the core problem is both approachable and solvable. But a lot more work is necessary, both on the technology side and in creating cross-industry acceptance and adoption.

Earlier, I also mentioned that we need “a minimum reporting bundle that lets buyers compare models in a disciplined way” — Both Recast and Stella have published some important articles here.

Restatement and Conclusions

I’m actually a big fan of MMMs. They’re an elegant idea with a long history, and they can be incredibly useful when solving thorny problems around marketing attribution and incrementality. But the commercial products are too hard to evaluate, and too hard to compare. When buyers face dozens of vendors, idiosyncratic claims, long implementation cycles, and case studies that rarely discuss model accuracy, trust in the category eventually erodes. And once that happens, everyone suffers. Even the good vendors will get dragged down with the bad.

Fortunately, we have prior art — many other software categories have dealt with this problem MMM needs public measurements, public benchmark datasets or generators, and credible third-party evaluations. Databases have a well-documented history and plenty of examples of how third-party evaluators can thrive alongside a robust set of products. Forecasting competitions have already shown that complex predictive systems can be compared and evaluated in public, for everyone to see. And individual companies within the MMM ecosystem have taken the first steps towards establishing cross-vendor comparison frameworks. If the MMM ecosystem takes those lessons seriously, the category can become more legible, more robust, and more valuable.

If it does not, my bet would be that adoption of commercial products will slow and then stall, and that MMM vendors will be replaced by in-house products built on top of open-source frameworks.

AUTHOR

Bill Grosso

More from Author

Peter Drucker’s Feedback Analysis Just Got a Lot More Practical (Thanks to AI)

Sometimes, Everyone Agrees

9 Overlapping Predictions That, Collectively, Explain Why Open Source Will Mostly Replace Commercial MMM Implementations Sometime in the Next Five Years