That we find out the cause of this effect,

Or rather say, the cause of this defect,

For this effect defective comes by cause.

-*Hamlet* by William Shakespeare

A few years back I stumbled upon the world of causality. Two things became immediately clear. First, it is a fascinating

]]>That we find out the cause of this effect,

Or rather say, the cause of this defect,

For this effect defective comes by cause.

-*Hamlet* by William Shakespeare

A few years back I stumbled upon the world of causality. Two things became immediately clear. First, it is a fascinating topic spanning some of the greatest scholars in philosophy, logic and statistics. Second, and most unfortunately, the literature on it tends to be a dense, uninviting tome.

The more I read, the more I realized that the central concepts are actually very simple. And beneath all the heated arguments about definitions of causality and frameworks for inferring it, the goals and principles of causal inference are one and the same. I wondered, and still wonder, why different disciplines choose to passionately accept one way of thinking about causality versus another. Especially as a computer scientist, I wondered why causal inference is not taught at all even in a graduate program, at a time when computers are affecting all walks of life.

As an outsider, I wished there was an easier way to describe the fundamentals of causal inference.

Last week I got an opportunity do just that. I gave a tutorial on causal inference at the International Conference on Computational Social Science (IC2S2). Here's my attempt at describing causal inference in an intuitive way, the way I have learnt it and the way I still understand it.

Hope this is useful. If it piques your interest, I also prepared sample code to play with recommender system data and methods for causal inference. Find it here: https://github.com/amit-sharma/causal-inference-tutorial

]]>

Two roads diverged in a wood, and I—

I took the one less traveled by,

But if I could go back, I will try

To take both: why one, then another.

When I first heard about doubly robust estimation, it sounded like magic, almost too good to be true. When

]]>Two roads diverged in a wood, and I—

I took the one less traveled by,

But if I could go back, I will try

To take both: why one, then another.

When I first heard about doubly robust estimation, it sounded like magic, almost too good to be true. When working with messy data, we are used to making tradeoffs. From statistics, there is the bias-variance tradeoff: you can't improve one without impacting the other. From computer science, there are time and space complexity tradeoffs: an algorithm can take up less time or less space, but devising one that reduces both is always hard.

Thus, in a world with no free lunches, the doubly robust promise stood out. Given two possibly faulty ways of estimating a quantity, a doubly robust estimator guarantees an **unbiased estimate**, whenever one of them is correct. This can be a boon when working with data with unknown distributions and sampling strategies.

How does it work? The math is non-intuitive, so it is best to start with an example.

Let us suppose we want to find the average height of people in a class. We could ask all the students to measure their height and then compute the mean of those observations.

$$\hat{H}_{mean}=\frac{\sum_{i=1}^n h_i}{n}$$

If all the students in the class were in attendance, the above formula will give us the correct answer. If not, and if some students were absent because of independent reasons, the mean estimate will still be an unbiased estimate of the average height in the class.

But what if some students were missing systematically? Say there was a basketball game and so many of the taller students were missing that day. Then the computed mean will be lower than the true average height of the class. One way to solve this problem is to know what kind of students are expected to be missing. Suppose we know that there are 5 students taller than 6" in the class, out of which 3 are missing. This means that any student taller than 6" has a 40% chance of being present in the class. To make matters simple, let us assume that all other students are present and therefore have a 100% chance of being in the class.

To account for absence of taller students, we could give more importance to observations from taller students.

$$\hat{H}_{IPS}=\frac{\sum_{i=1}^n h_i/p_i}{n}$$

where \(p_i\) is the probability that a student of height \(h_i\) is present in the class. In our example, \(p_i\) is 1 for everyone except tall students.

$$p_i = \begin{cases} 1 & \text{if } h_i<6 \\
0.4 & \text{if } h_i>=6

\end{cases}$$

Another way of writing the same formula is by summing over all students enrolled in teh class, and using an indicator variable \(o_i\) to denote whether the student was present.

$$\hat{H}_{IPS}=\frac{1}{n}\sum_{i=1}^N \frac{o_ih_i}{p_i}$$
Using this formula, we can expect to account for the missing tall students and achieve an unbiased estimate for the average height.

The problem though is that in many cases, we may not know which one of the above estimators to use. It could be that we do not have enough information about the missing students, and thus cannot accurately write down the probability of omission, \(p_i\) accurately. Or that the entire assumption about the missing basketball team is incorrect, and students across the height spectrum were absent at random.

Fortunately, the doubly robust estimator allows us to estimate the average height even when we are not sure about which assumption is true. It is given by:

$$\hat{H}_{DR}=\frac{1}{n} \sum_{i=1}^N [\frac{o_ih_i}{p_i} - \frac{o_i-p_i}{p_i}\hat{h}_{i,mean}]$$

Let's try to see why it works. First, let's rearrange some terms.

$$\hat{H}_{DR}=\frac{1}{n} \sum_{i=1}^N h_i + \frac{1}{n} \sum_{i=1}^N \frac{o_i-p_i}{p_i} (h_i-\hat{h}_{i,mean})$$

The DR estimator will be unbiased whenever the right term is zero.

- Let us suppose that the students are missing at random. Then, \(\hat{H}_{mean}=\sum_{i=1}^N h_i\), so the right term is zero.
- Similarly, when we suspect that taller students are absent more than other students and we can estimate \(p_i\) accurately, then \(\sum_{i=1}^N o_i-p_i = 0\) and again, we find that right term evaluates to zero.

*Voila!* In both cases, we obtain the correct average height using the DR estimator.

The same principle can be generalized to more complex scenarios. \(\hat{H}_{mean}\) can be generalized as a regression, based on some observed multi-dimensional data about students (\(X\)).

$$\hat{h}_{i,REG}=\alpha_0 + \sum_{j=1}^M \alpha_jx_j$$

Similarly, the calculation of propensity scores can be generalized, using observed data \(X\).

$$p_i = P(o_i=1|x_i) = Logit(\alpha_0 + \sum_{j=1}^M \alpha_jx_j)$$

Most generally, we require *some* way for estimating \(\hat{h}_{i,REG}\) and \(

p_i\). Any functional form, such as a learned decision tree or a machine learning model, can be substituted in place of the regression or logit models.

The catch, as you might imagine, is that in most practical cases, establishing even one of these assumptions is non-trivial. With messy data from the real-world, it is anybody's guess whether the data is missing at random, or what the correct probabilities of omission are.

Still, using a doubly robust estimator provides a useful check against modeling assumptions, as long as we do not err badly on both counts.

]]>For example, suppose you want to analyze the number of times people exercise in a month. One option would be to work with a data table consisting of the number of active days for each person.

```
library(dplyr)
library(ggplot2)
```

```
running_log <- data.frame(person_id=1:10000,
num_active_days=rpois(10000, lambda=15)) %>%
transmute(num_active_days=ifelse(num_active_days<=30, num_active_days, 30))
tbl_df(running_log)
```

```
## Source: local data frame [10,000 x 1]
##
## num_active_days
## 1 14
## 2 12
## 3 13
## 4 12
## 5 17
## 6 18
## 7 16
## 8 18
## 9 16
## 10 11
## .. ...
```

Using built-in function `ecdf`

, it is easy to generate the cumulative distribution function for such data.

```
cdf_fun <- ecdf(running_log$num_active_days)
plot(cdf_fun, xlab="Number of active days",
ylab="Number of people",main="")
```

For those interested, ggplot provides a function `stat_ecdf`

, which makes this process even simpler.

```
ggplot(running_log, aes(x=num_active_days)) + stat_ecdf()
```

Let us now compress our data by precomputing the frequency of people active on each day.

```
frequency_counts <- running_log %>% group_by(num_active_days) %>%
summarize(num_people=n())
frequency_counts
```

```
## Source: local data frame [28 x 2]
##
## num_active_days num_people
## 1 2 1
## 2 4 4
## 3 5 18
## 4 6 55
## 5 7 103
## 6 8 188
## 7 9 328
## 8 10 522
## 9 11 650
## 10 12 847
## .. ... ...
```

Note how this is a 30x2 table at worst, instead of 10,000x2 entries that we started with. Calculating the empirical cumulative distribution involves sorting the data and using `cumsum`

to calculate cumulative sums, then plot using the step function as before.

```
cumulative_frequencies <- frequency_counts %>%
arrange(num_active_days) %>%
mutate(cum_frequency=cumsum(num_people))
cumulative_frequencies
```

```
## Source: local data frame [28 x 3]
##
## num_active_days num_people cum_frequency
## 1 2 1 1
## 2 4 4 5
## 3 5 18 23
## 4 6 55 78
## 5 7 103 181
## 6 8 188 369
## 7 9 328 697
## 8 10 522 1219
## 9 11 650 1869
## 10 12 847 2716
## .. ... ... ...
```

```
ggplot(cumulative_frequencies, aes(x=num_active_days, y=cum_frequency)) +
geom_step() + xlab("Number of active days") + ylab("Number of people")
```

Now, suppose you wanted to break down your analysis by the type of activity. Let us assume that there are two types of activities: running and cycling.

```
cycling_log <- data.frame(person_id=1:10000,
num_active_days=rpois(10000, lambda=15)) %>%
transmute(num_active_days=ifelse(num_active_days<=30, num_active_days, 30))
activity_log <- rbind(running_log %>% mutate(activity_type="running"),
cycling_log %>% mutate(activity_type="cycling"))
frequency_counts = activity_log %>%
group_by(num_active_days, activity_type) %>%
summarize(num_people=n())
frequency_counts
```

```
## Source: local data frame [56 x 3]
## Groups: num_active_days
##
## num_active_days activity_type num_people
## 1 2 running 1
## 2 3 cycling 1
## 3 4 cycling 8
## 4 4 running 4
## 5 5 cycling 22
## 6 5 running 18
## 7 6 cycling 60
## 8 6 running 55
## 9 7 cycling 100
## 10 7 running 103
## .. ... ... ...
```

The same code can be used to generate the CDF, with an additional `group_by`

.

```
cumulative_frequencies <- frequency_counts %>% group_by(activity_type) %>%
mutate(cum_frequency=cumsum(num_people))
cumulative_frequencies
```

```
## Source: local data frame [56 x 4]
## Groups: activity_type
##
## num_active_days activity_type num_people cum_frequency
## 1 2 running 1 1
## 2 3 cycling 1 1
## 3 4 cycling 8 9
## 4 4 running 4 5
## 5 5 cycling 22 31
## 6 5 running 18 23
## 7 6 cycling 60 91
## 8 6 running 55 78
## 9 7 cycling 100 191
## 10 7 running 103 181
## .. ... ... ... ...
```

```
ggplot(cumulative_frequencies, aes(x=num_active_days, y=cum_frequency)) +
geom_step() + facet_wrap(~activity_type) +
xlab("Number of active days") + ylab("Number of people")
```

]]>