Chapter 5 · Product Analyst
5. A/B testing & experimentation
~11 min read
Experimentation is how product teams establish causation instead of guessing from correlation. For a product analyst it is a core skill and a heavily tested interview topic.
5.1 Why randomization is everything#
In an A/B test, users are randomly assigned to a control (current experience) or a variant (the change). Randomization makes the two groups statistically equivalent on everything except the change, so any difference in outcome can be attributed to the change. Remove randomization and you only have correlation.
5.2 The lifecycle of a sound experiment#
- Hypothesis. A specific, falsifiable prediction with a primary metric chosen in advance.
- Power analysis. Compute the sample size to detect the smallest effect worth caring about.
- Randomize & run. Assign users, run for full business cycles (cover weekends), avoid peeking.
- Analyze. Compare the primary metric, compute significance and CI, check guardrails.
- Decide. Ship, iterate, or kill, weighing statistical and practical significance together.
5.3 The concepts interviewers probe#
| Concept | Plain meaning | Common trap |
|---|---|---|
| Null hypothesis | 'No real difference' | Forgetting it's the default |
| p-value | Chance of data this extreme if null is true | '95% chance B is better', wrong |
| Significance (α) | False-positive rate you accept (5%) | Confusing with effect size |
| Power (1−β) | Chance of catching a real effect (80%) | Underpowered = inconclusive |
| Confidence interval | Plausible range for the true effect | More useful than p alone |
| MDE | Smallest effect you can reliably detect | Set before, not after |
5.4 When you can't run a clean A/B test#
Sometimes randomization isn't possible, network effects, a feature everyone sees, a pricing change. Reach for quasi-experimental methods: difference-in-differences, holdout groups, or staged geo rollouts. Naming one when asked 'what if you can't A/B?' sets you apart.
5.5 Sample size, intuitively#
| If you want to… | Sample size… |
|---|---|
| Detect a smaller effect | Increases (often dramatically) |
| Measure a noisier metric | Increases |
| Raise confidence (lower α) or power | Increases |
| Test on a higher base rate | Decreases |
5.6 Novelty and primacy effects#
Early results can lie in two opposite ways. A novelty effect is a temporary lift because the change is new. A primacy effect is a temporary dip because users are jarred before they adapt. Both resolve over time, run full cycles.
5.7 Network interference#
Standard A/B testing assumes one user's experience doesn't affect another's. That breaks for social, marketplace, and communication products. Teams handle this with cluster-randomized designs, randomize whole communities or geographies rather than individuals.
5.8 Failure modes, a reference#
| Pitfall | What goes wrong | The fix |
|---|---|---|
| Peeking | Stopping early at significance inflates false positives | Fix sample size up front |
| Underpowered test | Too few users to detect the real effect | Power analysis before launch |
| Ignoring guardrails | Primary metric wins, key metric quietly breaks | Pre-define guardrails |
| Novelty effect | Early lift fades as novelty wears off | Run full cycles; watch trend |
| Network interference | Variant affects control via the network | Cluster-randomize |
| Multiple comparisons | Testing many metrics finds false winners | Correct α (e.g. Bonferroni) |
| Sample ratio mismatch | Split isn't actually 50/50 → broken assignment | Check ratio first |
Get the next chapter and weekly interview tips by email
One short email per week. Skim in a minute. Unsubscribe anytime.
