A/B tests lie 🤥 & statistical significance is a myth 🦄

Sep 13, 2023

I was, and still am, a stats nerd 🤓. But, who else was fooled into the “confidence” of 95% confidence like me?

🎵 I had a manager once who liked to say that I could “make the numbers sing.” At the time that sounded like a good thing. 🎵

The problem is, like most buskers, the numbers will sometimes just sing 🎙️ what you want to hear. Not what you need to hear 🙊 🙉.

Even before my days in business school and my Sophomore year stats class, I’ve been a numbers and analytics person. There’s something tidy and comforting about statistics.

I revel in the unrelenting change that comes with startups. The only certainty is uncertainty. But you can always remain certain that the numbers will show the way in their unprejudiced, unbiased truth.

🔔 Bell curves & normal distributions
🐣 Regression analysis
𝞼 Standard deviation
🧐 R2 & p-value
☕ T-tests

Or so I thought.

I’ve been surprised over the years.

📈 Tests show improvement;

📉 Then, upon rollout, reality shows decline

What happened?! 😖

Lots can go wrong in A/B testing:

👀 Looking at the wrong business outcomes
👃 Not ensuring tests pass the sniff test
⏲ Focusing on only short term signals
🍤 Changes that are insignificant
🎲 Improper randomization
🛠️ Tooling problems
➕ and more...

🚨 Looking at the wrong business outcomes 🚨

Often what you’re monitoring is not really the outcome. For instance, you might drive more clicks, but fewer conversions. I’ve seen so many times where I increased the click through rate (CTR%) on an email, but crippled the sign-ups on the landing page. What’s your real target outcome to optimize?

🚨 Not ensuring tests pass the sniff test 🚨

You should understand WHY things work, or at least have an idea. Too often, I’ve seen when things are too good to be true, they often are. And this goes for A/B testing. Does the outcome not make sense? Maybe you should re-confirm your test.

🚨 Focusing on only short term signals 🚨

Sure, you got more sign-ups. But is the quality as good? Not all sign-ups are the same. Maybe you got 50% more purchases to your SaaS product...but what if you materially changed who responded, and they’re only going to stay around for 3 months instead of 3 years?

🚨 Changes that are insignificant 🚨

I don’t have infinite time, and we’re not an ecommerce site with millions of visitors and thousands of purchases per day. I’m not looking for incremental improvement of half of a percent. I need to look for tests that are at least a double digit improvement. Unless you’re a huge ecommerce company, the same probably goes for you. Plus, the smaller the change, the larger a sample size you need to be sure.

🚨 Improper randomization 🚨

Are you sure you’re getting a true representative sample and honest A/B split? For instance, a lot of software developers use ad blocking extensions that will keep them out of your A/B tests. So your tests are systematically missing part of your audience that may have different behavior.

🚨 Tooling problems 🚨

Tools aren’t perfect, and sometimes they just don’t do good randomization. Or they cross test groups. Or they don’t track all the conversions. Or they over count conversions. Should I continue? Because I could for a long time.

So, how can you make sure you’re running a sound testing program?

👥 A/A tests - QA your tools
🧑⚖️ Good judgement for tests and signals
💰 Understand your real business outcomes
☄ Test things that are big enough to have real impact

👍 A/A tests - QA your tools 👍

Just because it’s >95% confidence in your tool doesn’t mean it’s actually true. We periodically run A/A tests...and you’d be surprised how often one of the “variants” will win with >95% confidence. Now, we tend to include a variant that is the same as the control...the test is NOT called until that variant outcome matches the control outcome.

👍 Good judgement for tests and signals 👍

Again, is it too good to be true? It probably is. Think about what you’re testing. Can you look at the quality of conversions, for instance. Are you trading 10 Enterprise conversions for 15 SMB accounts? Are you getting a bunch of random, sketchy emails whereas before you were getting a few solid, real business emails?

👍 Understand your real business outcomes 👍

Make sure you’re testing what matters. Who cares about CTR% on an email if you want sign-ups. Clicks on an Adwords ad only matter if they convert on the page. Maybe you’re getting more conversions, but not all conversions are the same. I’ll take 10 demo requests over 100 ebook downloads usually.

👍 Test things that are big enough to have real impact 👍

I’ve told my team if it’s not a 20% lift, I don’t care. I want people thinking about 2x or more. If that’s our goal, we can call tests faster, we can test more, and we can have way more impact. Plus, we can be more certain in our outcomes.

But this is all just my learnings from doing this for a while. How do you ensure good testing?

Jeff’s Substack