Design of experiments

Statistical foundations for causal inference

Lukas Vermeer is Director of Experimentation at Booking.com

@lukasvermeer — lukasvermeer.nl/ab-stats

Making good decisions requires

causal inference

Umbrellas often appear just before it pours, but banning them will not stop the rain; it will just make everyone more wet

What we want to know (Rubin Causal Model)

The fundamental problem of

causal inference

We cannot sample the entire population
We cannot expose units to both treatments exclusively
We cannot directly observe underlying probabilities

What we can measure (Rubin Causal Model)

We will try to answer two

key questions

Is there any causal effect?
What is the size of the causal effect?

(If the size is non-zero, there is an effect.)

What randomization gives us (Rubin Causal Model)

Repeating the same experiment (Expectation)

Our answers will be correct

in expectation

If we show A and B to random samples of the population, then on average the fraction of yes in both groups will be equal to the true underlying means, and on average across replications of the experiment the difference between the means will be equal to the average treatment effect

Up, down or neither?(I don't know either)

Randomization ensures only three things can

explain a difference

Causation resulted in people behaving differently when treatment was applied
Pure chance resulted in a difference between the two groups unrelated to the treatment
Mistakes resulted in an unintended difference in results unrelated to the treatment

Is this die

Fair?

⚅⚅⚅

Not fair; I cheated.

How did you decide you were confident

that I was cheating?

You did not need to know what to expect from a loaded die. Instead, you simply rejected the idea that it was fair.

We want to reject the

null hypothesis

The null hypothesis assumes there is no treatment effect for any unit; any difference we observe is simply due to chance

If we could reasonably rule out mistakes and chance, we might reject the null and consider this to be evidence for an alternative

We compute a

p-value

Assuming there is no effect, the p-value is the probability of seeing a particular result or more extreme by chance.

How likely is this result assuming the null is true?

(The probability of rolling three sixes on a fair die is 0.00462962.)

Randomization inference (Rubin Causal Model)

When the null is false p-values

tend to be small

When the null is true p-values will be

uniformly distributed

We need to pick a

threshold

One swallow does not a summer make, nor one fine day, but how many swallows do we count before we pack away our umbrellas?

Scientific standard for significance: p < 0.05

Results under our threshold are called

statistically significant

Two types of

errors

Type-I is the incorrect rejection of a true null hypothesis; we cried foul when there was none
Type-II is the failure to reject a false null hypothesis; we failed to detect a real effect

Repeating the same experiment (No effect)

Repeating the same experiment (Small effect)

The importance of

Statistical power

Statistical power is the probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true

Two main things affect statistical power:

Sample size (more is better)
Effect size (more is better)

Repeating the same experiment (More power)

Repeating the same experiment (MOAR POWER!!11)

Up, down or neither?(I don't know either)

The importance of sticking to

Protocol

The methods described assume strict adherence to protocol; violations of protocol such as peeking and multiple testing increase the type-I error rate

Can telekenisis influence

Three dice?

Fair die; I still cheated.

(The probability of rolling three sixes on a fair die if you keep trying is 1.)

Repeating the same experiment (Peeking twice)

Repeating the same experiment (Peeking 100x)

Reasons for violating

Protocol

More flexible protocols may be desirable

early stopping rules to mitigate damage
early shipping to minimize opportunity cost
multiple variants to test several alternatives
multiple metrics to guard business KPIs

All these are possible[4], but require protocol adjustments

[4] Alex Deng, Tianxi Li, Yu Guo 2014 “Statistical Inference in Two-Stage Online Controlled Experiments with Treatment Selection and Validation” WWW '14. 609–618.