What You Will Learn
- Why email audiences are unique and require their own testing rather than adopting generic best practices
- The variables to test in priority order — from highest to lowest leverage
- How to set up a valid A/B test (not just send two versions)
- Minimum sample sizes for statistically meaningful results
- How to determine if a result is statistically significant or just noise
- When multivariate testing is appropriate and when it over-complicates
- How to build a systematic testing programme that compounds learnings
Why Test — Not Just Apply Best Practices
Email best practices — the guidelines in this guide and every other email marketing resource — are derived from aggregate data across thousands of senders and millions of subscribers. They represent average behaviour, not your specific audience's behaviour. Your audience may be different: older or younger, more professional or more casual, more price-sensitive or more brand-loyal, more responsive to humour or more responsive to data.
A/B testing tells you what works for your audience specifically. "Best practice says first-name personalisation in subject lines improves open rate" — does it for your subscribers? Test it. "Best practice says shorter emails get higher CTR" — does this hold for your audience? Test it. Only your test data tells you with confidence.
What to Test — Priority Order
| Variable | Metric Affected | Priority | Example |
|---|---|---|---|
| Subject line | Open rate | Highest | Curiosity gap vs specific benefit; question vs statement |
| CTA button text | Click rate | Very high | "Shop Now" vs "Get Yours" vs "See the Collection" |
| Email length | Click rate, conversion rate | High | Short (300 words) vs long (900 words) |
| Primary offer / hero content | Conversion rate | High | Product vs lifestyle hero image; discount vs free shipping offer |
| Send time | Open rate | Medium | Tuesday 9am vs Thursday 2pm |
| From name | Open rate, trust | Medium | "Digital Codex" vs "James at Digital Codex" |
| Preview text | Open rate | Medium | Benefit-focused vs curiosity-gap preview |
| Body copy tone | Click rate | Lower | Professional vs conversational |
| Image vs no image | Click rate, load time | Lower | HTML vs plain text |
Test in priority order — subject lines and CTAs have the highest leverage because they affect the core conversion funnel (open → click). Optimise these before testing lower-leverage variables like button colour.
Setting Up a Valid Test
A valid A/B test changes exactly one variable. All other elements must be identical between Version A and Version B — same send time, same audience segment, same campaign objective. If you change both the subject line and the CTA simultaneously, you cannot determine which change drove the performance difference.
Valid test setup checklist
- One variable changed: ✅
- Audience randomly split (not by segment): ✅
- Both versions sent simultaneously (not on different days): ✅
- Equal send volume to each version: ✅
- Sufficient sample size per version: ✅
- Win metric defined before sending (not selected post-hoc): ✅
Most ESP A/B test features handle the random split and simultaneous sending automatically. The variables you control are what you test and the win metric you select.
Sample Size and Test Duration
Too small a sample produces noise — Version B may appear to win simply because a few extra people opened during the test window, not because it is genuinely better. Minimum practical thresholds:
| What You Are Testing | Minimum Events per Variation | Why |
|---|---|---|
| Subject line (open rate) | 200+ opens per variation | Opens are relatively frequent; smaller sample viable |
| CTA / body (click rate) | 100+ clicks per variation | Clicks are less frequent; requires larger total send |
| Conversion (purchase/lead) | 50+ conversions per variation | Conversions are least frequent; very large sends required |
Translating to list size: if your average open rate is 25% and click rate is 3%, testing a CTA (needing 100 clicks per variation) requires sending to at least 3,300 subscribers per variation (100 clicks ÷ 3% CTR) — 6,600 total. Small lists may not have sufficient volume for valid conversion-level testing.
Test duration
Run tests for a minimum of 4 hours; ideally 24 hours to capture day-of-week patterns. Do not evaluate results before the test is complete — early leaders frequently reverse as more data arrives.
Statistical Significance
Statistical significance is the probability that a test result is real and not due to random chance. A 95% confidence level means there is a 95% probability the observed difference is genuine — a 5% chance it occurred by random variation in the sample.
Most ESP A/B test features calculate this automatically and declare a winner. If your ESP does not: use an online A/B significance calculator (numerous free tools available) — input the number of sends and number of opens/clicks for each version.
Interpreting results
- 95%+ confidence: Result is statistically significant — implement the winner
- 85–95% confidence: Directional signal — the winner is probably better, but test again with a larger sample before treating as definitive
- Below 85% confidence: Inconclusive — do not implement based on this test; consider the variables equivalent until a clearer result emerges
A common mistake is declaring a winner based on percentage difference alone: "Version B had 28% open rate vs Version A's 25% — 12% higher, Version B wins." Without significance testing, this 3-point difference may be random noise. A 3-point difference on 200 sends is much less reliable than a 3-point difference on 20,000 sends.
Multivariate Testing
Multivariate testing tests multiple variables simultaneously across multiple versions — for example, 4 combinations of 2 subject lines × 2 CTAs. It can find the optimal combination faster than sequential A/B tests.
The requirement: very large lists. A multivariate test with 4 variations needs 4× the sample of a single A/B test to achieve the same confidence level. For most email senders with lists under 50,000, sequential A/B tests are more practical and produce more reliable results than multivariate testing.
Multivariate testing is most appropriate for: landing pages (more traffic available); very large email lists (100,000+) with high-volume sends; and advanced email platforms with built-in multivariate support.
Building a Testing Programme
Individual tests answer individual questions. A testing programme compounds learnings over time — each test builds on the previous one, and patterns across tests reveal what your audience systematically responds to.
- Test on a regular cadence. Aim for at least one A/B test per month — not every send, but regularly enough that you accumulate knowledge over time.
- Maintain a test log. Record: what was tested, date, winner, confidence level, performance difference, and action taken. A log of 20+ tests starts showing clear patterns about your audience's preferences.
- Prioritise tests by potential impact. Use the priority order above — test subject lines before button colours. The highest-leverage variables yield more valuable insights.
- Test the same variable multiple times. One subject line test shows which of two specific subject lines won. Ten subject line tests start showing which types of subject lines (curiosity vs specificity vs urgency) consistently win for your audience — a much more valuable insight.
- Implement winners promptly. The point of testing is applying learnings. A test result implemented in next week's campaign has immediate value; a test result filed and forgotten has none.
Authentic Sources
Tracking deliverability impact of email programme improvements from testing.