Statistical foundations for causal inference

**Lukas Vermeer** is **Director** of **Experimentation** at **Booking.com**

Umbrellas often appear just before it pours, but banning them will not stop the rain; it will just make everyone more wet

- We cannot sample the entire population
- We cannot expose units to both treatments exclusively
- We cannot directly observe underlying probabilities

- Is there any causal effect?
- What is the size of the causal effect?

(If the size is non-zero, there is an effect.)

If we show A and B to random samples of the population, then on average the fraction of yes in both groups will be equal to the true underlying means, and on average across replications of the experiment the difference between the means will be equal to the average treatment effect

- Causation resulted in people behaving differently when treatment was applied
- Pure chance resulted in a difference between the two groups unrelated to the treatment
- Mistakes resulted in an unintended difference in results unrelated to the treatment

⚅⚅⚅

Not fair; I cheated.

You did not need to know what to expect from a loaded die. Instead, you simply rejected the idea that it was fair.

The null hypothesis assumes there is no treatment effect for any unit; any difference we observe is simply due to chance

If we could reasonably rule out mistakes and chance, we might reject the null and consider this to be evidence for an alternative

Assuming there is no effect, the p-value is the probability of seeing a particular result or more extreme by chance.

**How likely is this result assuming the null is true?**

(The probability of rolling three sixes on a fair die is 0.00462962.)

One swallow does not a summer make, nor one fine day, but how many swallows do we count before we pack away our umbrellas?

**Scientific standard for significance:** p < 0.05

- Type-I is the incorrect rejection of a true null hypothesis; we cried foul when there was none
- Type-II is the failure to reject a false null hypothesis; we failed to detect a real effect

Statistical power is the probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true

**Two main things affect statistical power:**

- Sample size (more is better)
- Effect size (more is better)

The methods described assume strict adherence to protocol; violations of protocol such as peeking and multiple testing increase the type-I error rate

Fair die; I still cheated.

(The probability of rolling three sixes on a fair die if you keep trying is 1.)

More flexible protocols may be desirable

- early stopping rules to mitigate damage
- early shipping to minimize opportunity cost
- multiple variants to test several alternatives
- multiple metrics to guard business KPIs

All these are possible[4], but require protocol adjustments

[4] Alex Deng, Tianxi Li, Yu Guo 2014 “Statistical Inference in Two-Stage Online Controlled Experiments with Treatment Selection and Validation” WWW '14. 609–618.

Statistical foundations for causal inference

**Lukas Vermeer** is **Director** of **Experimentation** at **Booking.com**

- Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688–701. (link)
- Goodman, Steve 2008. “A dirty dozen: twelve P-value misconceptions.” Seminars in Hematology, 45 (2008), pp. 135-140. (link)
- Kohavi, R., Longbotham, R., Sommerfield, D. et al. 2009 “Controlled experiments on the web: survey and practical guide” Data Min Knowl Disc (2009) 18: 140. (link)
- Alex Deng, Tianxi Li, Yu Guo 2014 “Statistical Inference in Two-Stage Online Controlled Experiments with Treatment Selection and Validation” WWW '14. 609–618.(link)