This is the first post in a series sharing the Snowplow manifesto around behavioral data: how to generate it, govern it and leverage it.
Every long-form post these days has to start with some first-person narrative to prove that the author isn't ChatGPT. So here goes: recently I’ve been mulling the future of event tracking, which has been in the news again this week with the launch of Hightouch Events.
Event tracking is a riddle of a technology. It defies attempts to turn it into a feature (more on this later), but unlike ETL, it often isn't treated as its own category. It's an "I work in a café" technology - "oh that's cool, what else do you do?". (Nobody embodies this parental disappointment better than Segment: they went from an open-source tag manager to a ‘Customer Data Infrastructure’ to a fully-paid-up CDP.)
Yali and I started Snowplow back in 2012 as a simple experiment: what would happen if brands could download an approximation of the engine underlying Google Analytics, run it native in their own AWS account, and get direct access in an S3 bucket to every single generated event? What insights unique to their business would this unlock? What cool things would they build?
In 2012 there were already plenty of multi-tenant SaaS tools which saw event tracking as a means to an end: they included web or mobile tracking as a feature to help power their analytics dashboards, marketing activation, fraud detection or whatever. Snowplow flipped the script and said that event tracking was the end in itself: that this behavioral (née clickstream) data was incredibly valuable for a whole host of use cases and shouldn’t be locked into a specific tool.
Our vision has taken a full decade to be realized, primarily because it has taken that long for the price-performance ratios of cloud data warehouses and lakehouses to really fly. I covered this technology trend in my launch post, Why martech is interesting again. But in the intervening 10 years, brands continued to adopt point solutions which insisted on their own event tracking.
On lock-in: a tale of two techs
This lock-in point is real because event tracking is sticky. Incredibly sticky. Event tracking is embedded throughout a brand's digital estate - websites, mobile apps, digital storefronts, games. At Snowplow we regularly talk to businesses that couldn't change their event tracking if they wanted to - think micro-sites built by long-departed agencies; triple-A games downloaded onto players' consoles; IoT sensors distributed around airports.1
To understand this lock-in, understand the nature of event tracking. Event tracking is often called “data collection”, but that is misleading: without deliberate instrumentation by your developers of your customers’ digital behavior into track(event) function calls, there is no data to collect. The process of event tracking deliberately conjures valuable data into existence which wouldn’t exist otherwise.
Compare this to another data software category - data extraction, the front-end of ELT or ETL. ETL tools are the Uber drivers of the Modern Data Stack: paid to drive the data from A (the source system) to B (your data warehouse) while expressing as little opinion or judgment as possible. Companies can and do rip-and-replace ETL tools on a regular basis - for a variety of reasons, not least cost or source connector coverage.
So we have a conundrum. To software vendors, event tracking is a great source of additional data to help feed their algorithms, and a powerful vector to lock-in their customers to their tool. But to savvy buyers, event tracking is like a strategic mountain pass into their territory: if they don't control it, they know from bitter past experience that they are going to have problems in the future.
Isn’t this what tag management was meant to solve?
Tag management is dying, slowly
Why do I need a primary tag at all? Why can’t I just embed a tag manager like Google Tag Manager or Tealium into my digital estate, and then invoke all my different tags through that?
Client-side management was a dominant force for years, but is now dying a slow death. Slow, because event tracking is incredibly sticky! But it’s on its way out nonetheless, and for three reasons:
Lack of governance & accountability
Customer data leakage
Performance issues
Let’s take a look at each of these in turn.
Lack of governance & accountability
If there is no primary tag in your client environments, there is no single author of your event tracking. Data discrepancies between different destinations become impossible to reconcile. Where did the error get introduced? Whose fault is it? There is no central source of truth for the original behavior being observed.- instead there will be multiple independent, incomplete records of that truth.
This is the diffusion of responsibility problem, aka the bystander effect: when 10 tags are in the client to observe digital behavior, nobody is observing digital behavior. Or, to paraphrase Henry Kissinger on Europe, “who do I call if I want to speak to the tag manager?”.
Customer data leakage
When a client-side tag manager invokes tags on a website, arbitrary JavaScript code from a set of third-party vendors is executed inside the context of the webpage. Over the years I have struggled to explain how completely insane this is.
The best analogy I can come up with is: you go into a bank to withdraw $5,000. Before the teller gives you your money, she calls up five external companies and asks them for additional instructions to follow. The bank does not see these additional instructions, does not approve them and cannot veto them. The only real protection you have: if they do anything noticeably awful, the teller will stop phoning them.
This is even more jarring when you compare this to the great processes and tools (e.g. Immuta) being implemented by data teams to lock-down access to your customer data once it’s in Snowflake, Databricks or similar. When I participated in the Packaged vs Composable CDPs debate back in spring, it was striking that every participant agreed that client-side tag management had been a huge wrong-turn for marketers.
Performance issues
Less exciting, but also important: client-side tag management is very costly from the perspective of network connectivity, bandwidth and so on. Performance issues are real: website conversion rates drop by an average of 4.42% with each additional second of load time between seconds 0-5.
As more and more web traffic shifts over to mobile, these performance issues have become more acute; they are also one of the reasons why tag management inside of mobile apps never took off to the same extent as on web.
Put these three trends together, and:
There can only be one (primary tag)
You need a single technology serving as the primary event tracker - the primary tag - across your digital estate: websites, mobile apps, games, IoT devices or whatever. This technology has to:
Observe all potentially meaningful behavior occurring across the digital estate
Govern all downstream access to the behavioral data stream, to maintain accountability, preserve consumer privacy and preserve data lineage
Allow other vendors to access selected subsets of the behavioral data in a performant and compliant manner
Of course, with great power comes great responsibility - you need to be incredibly intentional about who you decide to make the primary tag!
Competing visions
There are plenty of vendors out there that would love to be your primary tag - think of the lock-in! Let’s address a handful of competing scenarios:
A digital analytics tool wants to be your primary tag?
Honestly this is fairly common - digital analytics tools depend on some basic behavioral data flowing into them, so all of them ship with their own event tracking SDKs. Some of these tools then expose that event data in the data warehouse, like Google Analytics and Amplitude do. Amplitude also lets you stream its events downstream to other tools.
The delivery of this event data to data warehouses has muddied the water somewhat. How much lock-in does Google Analytics really have, if all of the data is being sync'ed to BigQuery? In fact, still a ton: they own the primary tag; they transform the data in a black-box way they control; they land the data in a format and schedule they dictate. They also mandate what you can and can't track - for example, you are forbidden from sending personally-identifiable information into GA. It's the illusion of freedom you get with airmiles - endless possibilities are imagined but the loyalty store will only send me a branded Yeti cup.
Amplitude is another interesting example. They are known for their robust product analytics offering - but, leveraging their event tracking, they have also now added their own CDP. A savvy customer will evaluate these additional offerings on their own merits - but if Amplitude is your primary tag, then it's going to be far harder to swap out these components for best-of-breed alternatives.
A CDP wants to be your primary tag?
Customer Data Platforms have a long and unsatisfying history of trying to featurize event tracking. The fundamental category error is that CDPs are designed for martech or growth teams, not for analytics teams. Analytics teams are trying to understand human behavior in all its glory - behavior that is incredibly rich, layered and heterogeneous; they need to join that behavioral data to wider data to provide enterprise-wide insight beyond just marketing into high-impact areas such as pricing, supply chain, fraud and compliance.
CDPs evolved on a different tech tree - if they do event tracking at all, it's just enough to serve the main course, which is marketing activation. CDP-originated tracking protocols like Segment's are far too low-fi and lossy to serve as a primary tag for capturing the richness of human behavior online.
If a CDP vendor needs data in real-time, for example for marketing triggers, this can be relayed from your primary tag server-side. The vendor can pick up the rest of the event data from the data warehouse using a composable CDP or dual-zone CDP architecture.
A cloud or database vendor wants to be your primary tag?
Run, don't walk! This is a coupling of very different layers of your infrastructure. This is such an odd concept that ChatGPT couldn't even find me a good analogy from a different industry.
Simply put: you must always be able to swap out lower infrastructural layers like your public cloud provider or your cloud data warehouse without impacting your higher customer data and marketing layers. Event tracking lock-in can't be allowed to block a company from making 8 or 9-figure decisions about cloud and data infrastructure selection.
Towards Switzerland: non-negotiables for your primary tag
We've established that your primary tag – your primary event tracker – needs to be Switzerland, but what does this really mean? Let’s set out the seven key constraints which your primary tag must operate under:
Your primary tag doesn't need to sell you something else. Beware the ‘razor-and-blades’ strategy where the primary tag is given away for free, to upsell you into downstream products like a CDP. Your primary tag has to stand on its own two feet as a purchasable product, as Snowplow BDP does.
Your primary tag is agnostic of processing infrastructure. Whatever cloud you use, whatever database you target, you need to know that your primary tag will go there. Otherwise the tail will wag the dog: infrastructure choices will be determined by your primary tag.
Your primary tag doesn't disadvantage other vendors in other categories. Easy to say, difficult to stick to. Primary tags often start as Switzerland but then the temptation of controlling the mountain pass is too great; competing vendors start to get frozen out.2 Snowplow is the outlier – we have been sticking to our knitting and not getting in other vendors’ way for a whole decade.
Your primary tag supports multiple distribution and pricing models. An important corollary to the lock-in – you have to be able to buy your primary tag in the way you want. Choose from different distribution models, like multi-tenant SaaS, private managed cloud or downloadable. Choose from different pricing models, not just event volumes or monthly tracked users.
Your primary tag minimizes data duplication. The core tenet here is “tag once, use many times”. Going further, the primary tag should encourage as many downstream vendors as possible to work off the data in the data warehouse or lake, to minimize cost and governance issues.
Your primary tag drives governance and compliance. Remember the diffusion of responsibilities problem described above: your primary tag needs to provide a single point of accountability and control for all downstream tools and processes, safeguarding consumers’ data privacy.
Your primary tag is owned by you, not rented. You need an ‘escape hatch’ to ensure that you own your primary tag, not your vendor. That’s why Snowplow has been proudly open-source for 10 years.
Put all of these together, and the only true primary tag – the only Switzerland – is Snowplow. And in the next post in this manifesto, we’ll look at how a primary tag like Snowplow actually works…
My favorite example of stickiness is the British Airways website, which still runs Tagman, the original tag manager, a full nine years after it was acquired by rival tag manager Ensighten. Ensighten in turn was purchased by CHEQ last year. A real turducken! But event tracking is sticky.
We learnt this lesson the hard way in 2014 when the Snowplow integration for Segment was rejected (Segment upstream, Snowplow downstream). No hard feelings, Peter!
Hey Alex, I'm happy to have stumbled into your Substack today. I've been interested in what you're doing at Snowplow for some time now. I applaud you for writing down this founder's vision; I enjoyed reading it. But I still don't understand precisely what you mean by a "tag." I'm currently covering the data integration space for various companies and investors, and even though I read this twice, I'm still not grokking it. I get the notion of first-party data and the value of collecting it, but I'm still unsure I could articulate why you're different -- it has something to do with this notion of a "tag?" Unless I missed it, this post doesn't crisply define what a tag is.
Can you give me a concrete example of what a "tag" is, how it's used, and how it's the "primary key" to understanding customer behavior?
Couldn't wait till the weekend to read this and I gotta say, it resonates hard. Big ups to the Snowplow team for sticking to their source of inspiration and a problem they deeply understand.