A/B Testing

A/B testing is a controlled experiment that randomly splits traffic between variants to measure which version produces a statistically significant lift on a defined metric. Also called split testing, it works across webpages, emails, ads, or product flows, with metrics like signup, purchase, click, or retention. It is the core experimental method underneath conversion rate optimization and most modern growth marketing.

A valid A/B test requires three things: random assignment (users land in variant A or B by chance, not by characteristic), a pre-defined primary metric (decided before the test starts, not picked after), and adequate sample size (calculated before launch using the baseline conversion rate, the minimum detectable effect, and the desired statistical power, typically 80 percent at a 5 percent significance level). The rule-of-thumb sample size to detect a 10 percent lift on a 5 percent baseline conversion rate is roughly 30,000 visitors per variant; smaller lifts require dramatically more sample. Industry data from Optimizely, VWO, and Eppo suggests roughly 1 in 7 properly-powered A/B tests delivers a real lift; the rest are flat or negative. Multi-variant tests (A/B/n or full factorial) divide traffic across more variants and need proportionally more sample. The two most common ways founders break their tests are peeking (stopping early when results look good) and ad-hoc segmentation (slicing the results after the fact until a "winning" segment emerges), both of which inflate the false-positive rate well above the stated 5 percent.

Ryan's Take

Most startup A/B tests are not A/B tests. They are A/B-shaped vibes. The team ships two versions, waits a week, eyeballs the numbers, declares a winner, and moves on. Then the "winning" change does not move the metric in production and everyone is mystified. The honest test requires you to commit, in writing, to the sample size and the metric before you start, and to actually wait. That feels slow. It is slow. It is also the only way to learn something real. The alternative is a culture of theatrical experimentation that produces zero compounding knowledge.

What founders get wrong: Running A/B tests at traffic volumes where statistical significance is mathematically impossible. If you cannot reach the calculated sample size in a reasonable timeframe (usually 2 to 4 weeks), the test cannot resolve and the result is noise. At low traffic, do qualitative research and message testing instead. A/B testing is a tool for medium-to-high traffic surfaces.

FAQ

What is A/B testing?
A controlled experiment that randomly splits traffic, users, or recipients between two or more variants of a webpage, email, ad, or product flow to measure which version produces a statistically significant lift on a defined metric. The core experimental method of conversion rate optimization.

How long should an A/B test run?
Long enough to reach the pre-calculated sample size at 80% statistical power and 5% significance, typically 2 to 4 weeks for medium-traffic pages. Running past the calculated point is fine; stopping before it is not. Tests with traffic too low to reach that sample size cannot resolve.

Why do most A/B tests fail to replicate?
Two main reasons: peeking at results midway and stopping early (which inflates false positives well above 5%), and ad-hoc segmentation of results after the test (which mines noise for "wins"). Both are easy to fall into and both produce changes that do not move the metric in production.

Find this article helpful?

This is just a small sample! Register to unlock our in-depth courses, hundreds of video courses, and a library of playbooks and articles to grow your startup fast. Let us Let us show you!

Submission confirms agreement to our Terms of Service and Privacy Policy.