Data Clean Rooms: Feature, Product or Platform?
Jonathan Mendez on this emerging data collaboration technology
Introduction
Data clean rooms have been in the news again following LiveRamp’s surprise acquisition of Habu for $200m last week. But what exactly is a data clean room? In this long-form piece, written by Jonathan Mendez, we explore the rise of data clean rooms, their capabilities, how they interoperate with CDPs and what the major cloud providers are doing to provide clean room capabilities.
It’s a delight having Jonathan Mendez back on my Substack after his first piece, on CTOs and the Composable CDP. I can’t think of anybody better to tackle this topic - Jon has grappled with questions of customer data, identity, federation and collaboration throughout his career in media and adtech. Most recently he founded Neuralift AI, a first-party data application that tells brands how to lift their Marketing KPIs.
Whether you are a clean room expert or just looking to orientate around this space, there should be plenty to learn from the below! Let’s get started.
An accelerating pace of change
If you’re looking for the fastest-changing technology markets of the past decade, adtech and martech are at the top of the list.
Now that these two different worlds are merging, the pace of change is only accelerating. This rate of change revolves around first-party (1P) data - data that companies collect and own through their direct relationships with customers and the privacy policies that govern these relationships.
There is no better example of how fast this market is changing than data clean rooms:
IDC Prediction: By 2024, 65% of G2000 Enterprises Will Form Data-Sharing Partnerships with External Stakeholders via Data Clean Rooms to Increase Interdependence While Safeguarding Data Privacy and Precious Data Assets
What are data clean rooms?
Data clean rooms are increasingly becoming the linchpin of 1P data collaboration. They are a natural progression from the technical requirements and use cases that emerged from customer data platforms (CDPs) around five years ago.
Where CDPs focused on unifying customer profiles and associated data from disparate vendors and tools into a single, unified ID, data clean rooms focus on the use cases of sharing these tables in a privacy compliant manner between organizations. Ultimately, clean rooms extend the value of CDPs.
So it makes sense that in 2019, just as CDP vendors were gaining traction and funding, a new set of vendors emerged focusing on data privacy, security and collaboration. Habu and InfoSum both came to market during this period with what quickly became known in MarTech as the very appealing sounding ‘Clean Room’.
The early days of clean rooms
Data clean rooms provide a way to run matching overlays and use first-party data, similar to what data management platforms (DMPs) have done and continue to do. In fact, the team that founded Habu came from the market-leading DMP Krux (acquired by Salesforce in 2016).
The early adopters of data clean rooms were those companies that depend on data collaboration for revenue - namely advertising, marketing, media and data companies - and who are grappling with data privacy in the face of GDPR, CCPA and a slew of new data protection laws.
Data clean rooms were out-of-the box solutions that injected synthetic data into the tables and obfuscated personally identifiable information (PII). This immediately leveraged the newfound value of their first-party data through collaboration.
Nothing excites VCs more than a new SaaS vertical and the race was on!
For the early clean room products, privacy and collaboration were seemingly not enough. The features that Habu and InfoSum built on top of the clean room sounded a lot like CDPs: features like identity, attribution, segmentation and activation.
The talk around clean rooms and CDPs has become interesting and confusing. This is because the cloud platforms used for the underlying clean room systems (including AWS, Databricks and Snowflake) are not only proponents of unbundled or composable CDPs, but are also launching their own clean room offerings.
On the flip side we can expect CDPs to start using synthetic data and prioritizing privacy-centric features, to preserve market share in an increasingly crowded - and confusing - market.
It is a legitimate question to ask: “Do I need a CDP or do I need a clean room?”
Key clean room use cases
Before we go further let’s look at the core use cases for clean rooms.
Audience Matching/Overlap
This is the primary use case for media companies. Advertisers and agencies have their own datasets of audience profiles. These are overlaid against each other and activated as segments/audiences for ad matching/delivery. This also allows for the understanding of incrementality and frequency management.
Enrichment
This is the old second- and third-party data that is now no longer collected in the DMP, but in the clean room. Of course, this is now all magically ‘first-party data’! You can get enrichments like household income, DMAs, ‘in-market for X’ attributes et al to help you expand your audience addressability.
Attribution/Journey
Campaigns need back-end performance data. For example, a CPG advertising on a RMN (Retail Media Network) could get conversion data for their SKUs, which is matched with impression and click data. An auto OEM could see how their ad campaigns are driving purchases at dealerships. You can also add 1P behavioral data into the mix and create a journey analysis.
Use cases are underpinned by identity resolution
These cover virtually all use cases for data in marketing/advertising, except for one large need necessitated by first-party data: identity resolution.
Not to be outdone by the CDPs who came to market with Identity / 360° view of the customer as their primary use case (and rightly so!), clean rooms can also resolve identity.
This is from the Habu site:
In this way, clean rooms are more flexible than CDPs. They can work with an ID graph provider or create the graph for you. This is where clean room vendors like Habu and InfoSum start to feel more like a holistic data platform.
In fact, the clean room use cases themselves mirror many of the CDP use cases (which are by definition platforms). This is, of course, exactly what the clean room vendors knew would happen, given that they are starting from the most important use case for any customer-facing business: data privacy.
Data protection fines are increasingly prevalent worldwide. In 2024, I expect to see an increasing number of US states starting to issue fines, and I suspect that the highest fines will rise from nine figures these past few years to ten figures in the coming years.
Public clouds’ clean room offerings
But let’s get back to the cloud platforms, because it was the release of Amazon’s Clean Room service at re:Invent in 2022 that first got me asking, “what exactly is a clean room?”
Undeniably, we’ve watched enterprises embrace cloud the past five years. Worldwide end-user spending on public cloud services totaled $563.6 billion in 20231. Azure in particular has made great inroads and Microsoft is well positioned in enterprise space.
However, if Amazon and Google understand anything, it’s that digital is about scale and the long tail.
There are close to 2 million active sellers on Amazon.com and while Google doesn’t disclose numbers, it is believed it has over 4 million advertisers and 2 million publishers on its search and display ad networks. Massive!
Smaller workloads with higher unit costs for compute is an enticing model for cloud growth. It’s no accident that AWS and GCP are moving closer to the customer, using advertising and marketing as the wedge.
They know that the advertising and marketing use cases that have come to the fore will become increasingly important in cloud vendor selection over the next few years.
With Google and Amazon already the dominant first-party ad platforms, it makes sense that clean rooms have been rolled out as a feature of Google Ads Data Hub and Amazon Marketing Cloud. The fact that both companies have the two largest and richest consumer ID graphs that they can utilize further enhances this opportunity.
Google Clean Room
Ads Data Hub is Google’s version of a clean room. Amazingly, it was launched almost seven years ago for mobile measurement of YouTube ads.
The original use case when it was introduced was view-through conversions from YouTube and GDN. Advertisers connect their transactional data in BigQuery via the Ads Data Hub API to a Google run BigQuery instance that had all the campaign data from Google. While this came to market as a “Hub”, it was really just an API.
In October 2022, Google decided that the use cases for measurement and marketing were differentiated enough that it needed to create two distinct solutions of Ads Data Hub: Ads Data Hub for Marketers and Ads Data Hub for Measurement.
The main use case for marketers is audience matching and overlap with something called PAIR (Publisher Advertiser Identity Reconciliation).
Of course, you can easily activate these cohorts in Google’s ad universe and especially in YouTube, which Google still prevents DSPs from buying directly (even after an EU antitrust settlement). This reluctance becomes a little more understandable when you see how Ads Data Hub is coming to fruition in Google Cloud’s most important strategic growth area - BigQuery.
Ads Data Hub for Measurement seems like a bone thrown out to all the data dogs, namely agencies and consultancies. Specifically, it enables YouTube to be measured alongside other CTV where its outsized performance will lead to more YT ad spend.
It’s interesting that Google felt it necessary to split Ads Data Hub in this way. The feedback I’ve heard from people who have used Ads Data Hub historically has not been positive. It’s clear that the product has some issues yet to be resolved. But there is always time for Google to improve and take market share. Witness their success via scale leverage with DFP/DFA (DoubleClick for Advertisers/Publishers) over the course of a decade into the dominant display advertising marketplace due to their overwhelming demand.
With martech and adtech converging, the coming ‘Big Kahuna’ for Google Ads Data Hub will likely be the first-party data generated by Google Analytics 4 and landed into BigQuery. There are 5 million companies using Google Analytics - a vast footprint that can start to leverage Ads Data Hub Clean Room.
Amazon Clean Room
Amazon does not have a site/app analytics tool to collect event data. Therefore it makes sense for them to take a ‘top-down’ strategy, versus a GCP-style ‘bottom-up’ strategy.
In early December 2022, Amazon announced a suite of services collectively called Amazon Marketing Cloud. The suite contains five different solutions:
ID Resolution
Clean Room/Data Collab
Measurement/Insights
Personalization
Ad Activation
Clean Room was of course a key part of this solution, and overall I have to applaud AWS’s strategy and their go-to-market positioning. In fact…
.. “clean rooms in minutes” is a pretty compelling go-to-market!
Here's the AWS architecture:
So a question emerges. If a clean room is really an API that sits between two lakes and pipes into an orchestrator and a reporter, then isn’t this just one feature of the data lake, similar to a service like monitoring?
I don’t want to diminish the value of a clean room – in fact it may be more valuable in this architecture than as its own product or platform. As you can see above, it serves as glue between databases.
But there’s no denying that the major cloud providers are trying to offer these features. And in fact, it’s not just AWS and GCP.
Data platforms’ clean room offerings
Databricks Clean Room
Databricks unveiled its Clean Room for Lakehouse in June 2022. In conjunction with Delta Lake and as part of its Unity Catalog Governance (which itself is based on the open standard ANSI SQL), Databricks Clean Room is clearly a feature of the Delta Sharing solution.
Databricks makes no secret in its positioning that this feature replaces clean room vendor lock-in and is interoperable with other cloud providers outside of Databricks.
Databricks’s case against clean room vendors is three-pronged:
Data movement and replication is difficult
Vendors are restricted to SQL
Collaboration is limited to two participants at a time
Snowflake Clean Room
Snowflake has positioned its Clean Room service for three years as a build-it-yourself. In 2022, it re-emerged as a “Global Clean Room” and is marketed to media, entertainment and advertising use cases as a framework as it now requires the Snowflake Native Application Framework.
In order to collaborate on data in a clean room environment with Snowflake, both parties require a Snowflake instance deployed in the same cloud and region. As a result, to use Snowflake’s Data Clean Room capabilities, the data must first be replicated to the provider’s account in the desired cloud regions.
Snowflake Clean Room queries are not SQL queries but use a Jinja template. The following templates are available:
Audience Overlap and Segment Creation
Customer Enrichment
Campaign Conversion Measurement
Lookalike Segments
Snowflake’s proposition here looks set to get even stronger with their acquisition of standalone clean room vendor Samooha in December 2023.
The customers of clean rooms
Now that we’ve taken a look at the various data clean room solutions in the market and how they fit into the broader architecture, one thing is for certain - the use cases that have emerged are pervasive and commoditizing. So the differentiating factor between a feature, product or platform is not “what” but rather “who”.
(source)
So who needs a clean room? Maybe that will tell us if they are a feature, a product or a platform? A brand, a media company, an agency, an ad network - they all have different flavors of their business and sit at different parts of the data collaboration chain. Now that we’ve looked at clean rooms from the technology and use case lenses, let's have a look at the collaborator lens as well to get a complete view of this market.
Media companies
As we’ve seen, the large media platforms (Google and Amazon) have some flavor of clean room.
We’re also seeing media companies like NBCU, Disney and Roku build clean rooms. In effect, they are creating their own walled gardens and using their own IDs to create aggregate level reporting.
The issue here is interoperability and cross-channel measurement, which is difficult to achieve. Vinny Rinaldi, Head of Media & Analytics at Hershey’s, tweeted about the growing issue of more and more walled gardens emerging in relation to Retail Media Networks (all of which use some form of clean room for audience matching).
(source)
If you are a media company, the clean room is definitely a platform. You have multiple parties bringing their disparate data sets into your environment, where the data is aggregated, and activated, and where the groundwork for measuring and reporting media performance takes place. And as discussed, we can expect clean rooms to replace DMPs and publisher CDPs over time. Disney and NBCU appear to be the early adopters.
Data companies
Not to be forgotten in all this analysis are the data companies themselves. They also have clean rooms in which all use cases can be handled for a brand.
Most prominent among these is LiveRamp. LiveRamp started with their own in-house clean room technology called Safe Haven, and have now dramatically augmented their clean room offering with last week’s acquisition of Habu.
TransUnion has a clean room called TruAudience with its own identity graph, plus enrichment. Epsilon’s clean room is called “People Cloud Prospect” (I am not making that name up) - they have taken their 200 million CORE IDs and integrated with measurement partners like IRI.
The other big data players like Acxiom and Merkle all have similar clean room solutions. These are typically used by brands and agencies for enrichment, so I think it’s safe to call these products.
Brands
The owned-and-operated (O&O) clean room model (you could also call it the pure-play model) is interesting for brands because of their direct 1P relationship with consumers.
If you want to have your own clean room to solve the use cases noted multiple times above, this is your bucket. In these instances, you own your own collaboration settings and rules, and maintain full control over the data once those rules are set.
What’s interesting here is how identity graphs can become part of the core offering. As we can see, identity is always at the heart of clean rooms because the fundamental use case of clean rooms is privacy.
This also opens the door for CDP vendors offering clean room capabilities, given that the main use case of CDP is a unified customer identity across all behaviors and touchpoints.
In conclusion: open questions and challenges
The answer to the original question “Are clean rooms features, products or platforms?” is of course “all of the above.” It really depends on who you are in the advertising and marketing value chain and what you are trying to accomplish with data collaboration.
While clean rooms are a big step forward for data collaboration and the convergence of adtech and martech, they also create new problems that still need to be solved. Measurements that are already difficult become even more challenging and clean rooms create additional work for people and compute costs for businesses.
Ultimately, as with all other data privacy enhancements, the people with the most data will benefit the most. It's easy to see how Amazon and Google can grow their cloud businesses by providing cloud-based solutions to these problems.
It’s not as easy to predict how the marketing SaaS landscape will evolve. Products and services need to be built on top of the data warehouse, at the collection layer or on the activation edge to derive greater value from data joins – particularly if these products and services are to be accessible to data science. The recent clean room vendor acquisitions by Snowflake and LiveRamp offer encouraging pathways towards reducing friction with more integrated tooling.
The most exciting thing is that we are on the cusp of a new era of data collaboration thanks to clean rooms. I wrote about this a few years ago and we are just beginning to see the benefits of bringing data together. The move to cloud is likely to usher in a seismic shift in the way data is used and the benefits it brings to advertising and marketing.
One thing is for certain, we’ll never go back to third-party data playing such a prominent role in media, advertising and marketing.
First-party data is the foundation for your customers and audience. More specifically, behavioral data (as generated by Snowplow!) is the most precious of all data for understanding the economics of your business. The faster you become an expert in operationalizing it to achieve your business goals, the more competitive advantage you will gain. Since data is a motorsport, keep your data engines clean!
About the author
Jonathan Mendez has been a Founder/CEO in eCommerce, martech and adtech. His early work at Offermatica led to the company being acquired by Omniture and becoming Test & Target and then Adobe Target. He then founded Yieldbot and used first-party behavioral event data to become the second fastest growing technology company in North America from 2012-2016 (Deloitte Fast500). He has been writing about martech and adtech for over 15 years on his blog Optimize and Prophesize; Jon’s new startup is Neuralift AI and it’s his best one yet.
https://www.gartner.com/en/newsroom/press-releases/11-13-2023-gartner-forecasts-worldwide-public-cloud-end-user-spending-to-reach-679-billion-in-20240
The Databricks Clean Room supports both SQL and Python.
Thanks for this, super useful!
I do wonder about the underlying business model for DCRs: how should one think in terms of pricing exactly? what's the basis? what's the incentive exactly?