Causal Bandits: Learning Good Interventions via
Causal Inference
Finnian Lattimore
Australian National University and Data61/NICTA
XXXXXXXXXX
Tor Lattimore
University of Alberta
XXXXXXXXXX
Mark D. Reid
Australian National University and Data61/NICTA
XXXXXXXXXX
Abstract
We study the problem of using causal models to improve the rate at which good
interventions can be learned online in a stochastic environment. Our formalism
combines multi-arm bandits and causal inference to model a novel type of bandit
feedback that is not exploited by existing approaches. We propose a new algorithm
that exploits the causal feedback and prove a bound on its simple regret that is
strictly better (in all quantities) than algorithms that do not use the additional causal
information.
1 Introduction
Medical drug testing, policy setting, and other scientific processes are commonly framed and analysed
in the language of sequential experimental design and, in special cases, as bandit problems (Ro
ins,
1952; Chernoff, XXXXXXXXXXIn this framework, single actions (also refe
ed to as interventions) from a
pre-determined set are repeatedly performed in order to evaluate their effectiveness via feedback from
a single, real-valued reward signal. We propose a generalisation of the standard model by assuming
that, in addition to the reward signal, the learner observes the values of a number of covariates drawn
from a probabilistic causal model (Pearl, XXXXXXXXXXCausal models are commonly used in disciplines
where explicit experimentation may be difficult such as social science, demography and economics.
For example, when predicting the effect of changes to childcare subsidies on workforce participation,
or school choice on grades. Results from causal inference relate observational distributions to
interventional ones, allowing the outcome of an intervention to be predicted without explicitly
performing it. By exploiting the causal information we show, theoretically and empirically, how
non-interventional observations can be used to improve the rate at which high-reward actions can be
identified.
The type of problem we are concerned with is best illustrated with an example. Consider a farme
wishing to optimise the yield of her crop. She knows that crop yield is only affected by temperature,
a particular soil nutrient, and moisture level but the precise effect of their combination is unknown.
In each season the farmer has enough time and money to intervene and control at most one of these
variables: deploying shade or heat lamps will set the temperature to be low or high; the nutrient
can be added or removed through a choice of fertilizer; and i
igation or rain-proof covers will keep
the soil wet or dry. When not intervened upon, the temperature, soil, and moisture vary naturally
from season to season due to weather conditions and these are all observed along with the final crop
yield at the end of each season. How might the farmer best experiment to identify the single, highest
yielding intervention in a limited number of seasons?
1
a
X
iv
:1
60
6.
03
20
3v
1
[
st
at
.M
L
]
1
0
Ju
n
20
16
Contributions We take the first step towards formalising and solving problems such as the one
above. In §2 we formally introduce causal bandit problems in which interventions are treated as
arms in a bandit problem but their influence on the reward — along with any other observations
— is assumed to conform to a known causal graph. We show that our causal bandit framework
subsumes the classical bandits (no additional observations) and contextual stochastic bandit problems
(observations are revealed before an intervention is chosen) before focusing on the case where, like
the above example, observations occur after each intervention is made.
Our focus is on the simple regret, which measures the difference between the return of the optimal
action and that of the action chosen by the algorithm after T rounds. In §3 we analyse a specific
family of causal bandit problems that we call parallel bandit problems in which N factors affect the
eward independently and there are 2N possible interventions. We propose a simple causal best arm
identification algorithm for this problem and show that up to logarithmic factors it enjoys minimax
optimal simple regret guarantees of Θ̃(
√
m/T ) where m depends on the causal model and may
e much smaller than N . In contrast, existing best arm identification algorithms suffer Ω(
√
N/T )
simple regret (Thm. 4 by Audibert and Bubeck XXXXXXXXXXThis shows theoretically the value of ou
framework over the traditional bandit problem. Experiments in §5 further demonstrate the value of
causal models in this framework.
In the general casual bandit problem interventions and observations may have a complex relationship.
In §4 we propose a new algorithm inspired by importance-sampling that a) enjoys sub-linear regret
equivalent to the optimal rate in the parallel bandit setting and b) captures many of the intricacies of
sharing information in a causal graph in the general case. As in the parallel bandit case, the regret
guarantee scales like O(
√
m/T ) where m depends on the underlying causal structure, with smalle
values co
esponding to structures that are easier to learn. The value of m is always less than the
number of interventionsN and in the special case of the parallel bandit (where we have lower bounds)
the notions are equivalent.
Related Work As alluded to above, causal bandit problems can be treated as classical multi-armed
andit problems by simply ignoring the causal model and extra observations and applying an existing
est-arm identification algorithm with well understood simple regret guarantees (Jamieson et al.,
2014). However, as we show in §3, ignoring the extra information available in the non-intervened
variables yields sub-optimal performance.
A well-studied class of bandit problems with side information are “contextual bandits” Langford
and Zhang (2008); Agarwal et al XXXXXXXXXXOur framework bears a superficial similarity to contextual
andit problems since the extra observations on non-intervened variables might be viewed as context
for selecting an intervention. However, a crucial difference is that in our model the extra observations
are only revealed after selecting an intervention and hence cannot be used as context.
There have been several proposals for bandit problems where extra feedback is received after an
action is taken. Most recently, Alon et al. (2015), Kocák et al XXXXXXXXXXhave considered very general
models related to partial monitoring games (Bartók et al., 2014) where rewards on unplayed actions
are revealed according to a feedback graph. As we discuss in §6, the parallel bandit problem can be
captured in this framework, however the regret bounds are not optimal in our setting. They also focus
on cumulative regret, which cannot be used to guarantee low simple regret (Bubeck et al., XXXXXXXXXXThe
partial monitoring approach taken by Wu et al XXXXXXXXXXcould be applied (up to modifications for the
simple regret) to the parallel bandit, but the resulting strategy would need to know the likelihood
of each factor in advance, while our strategy learns this online. Yu and Mannor XXXXXXXXXXutilize
extra observations to detect changes in the reward distribution, whereas we assume fixed reward
distributions and use extra observations to improve arm selection. Avner et al XXXXXXXXXXanalyse bandit
problems where the choice of arm to pull and arm to receive feedback on are decoupled. The main
difference from our present work is our focus on simple regret and the more complex information
linking rewards for different arms via causal graphs. To the best of our knowledge, our paper is the
first to analyse simple regret in bandit problems with extra post-action feedback.
Two pieces of recent work also consider applying ideas from causal inference to bandit problems.
Bareinboim et al XXXXXXXXXXdemonstrate that in the presence of confounding variables the value that a
variable would have taken had it not been intervened on can provide important contextual information.
Their work differs in many ways. For example, the focus is on the cumulative regret and the context
is observed before the action is taken and cannot be controlled by the learning agent.
2
Ortega and Braun XXXXXXXXXXpresent an analysis and extension of Thompson sampling assuming actions
are causal interventions. Their focus is on causal induction (i.e., learning an unknown causal model)
instead of exploiting a known causal model. Combining their handling of causal induction with ou
analysis is left as future work.
The truncated importance weighted estimators used in §4 have been studied before in a causal
framework by Bottou et al. (2013), where the focus is on learning from observational data, but
not controlling the sampling process. They also
iefly discuss some of the issues encountered in
sequential design, but do not give an algorithm or theoretical results for this case.
2 Problem Setup
We now introduce a novel class of stochastic sequential decision problems which we call causal
andit problems. In these problems, rewards are given for repeated interventions on a fixed causal
model Pearl XXXXXXXXXXFollowing the terminology and notation in Koller and Friedman (2009), a causal
model is given by a directed acyclic graph G over a set of random variables X = {X1, . . . , XN}
and a joint distribution P over X that factorises over G. We will assume each variable only takes
on a finite number of distinct values. An edge from variable Xi to Xj is interpreted to mean that a
change in the value of Xi may directly cause a change to the value of Xj . The parents of a variable
Xi, denoted PaXi , is the set of all variables Xj such that there is an edge from Xj to Xi in G. An
intervention or action (of size n), denoted do(X = x), assigns the values x = {x1, . . . , xn} to the
co
esponding variables X = {X1, . . . , Xn} ⊂ X with the empty intervention (where no variable is
set) denoted do(). The intervention also “mutilates” the graph G by removing all edges from Pai
to Xi for each Xi ∈X . The resulting graph defines a probability distribution P {Xc|do(X = x)}
over Xc := X −X . Details can be found in Chapter 21 of Koller and Friedman (2009).
A learner for a casual bandit problem is given the casual model’s graph G and a set of allowed actions
A. One variable Y ∈ X is designated as the reward variable and takes on values in {0, 1}. We denote
the expected reward for the action a = do(X = x) by µa := E [Y |do(X = x)] and the optimal
expected reward by µ∗ := maxa∈A µa. The causal bandit game proceeds over T rounds. In round t,
the learner intervenes by choosing at = do(Xt = xt) ∈ A based on previous observations. It then
observes sampled values for all non-intervened variables Xct drawn from P {X
c
t |do(Xt = xt)},
including the reward Yt ∈ {0, 1}. After T observations the learner outputs an estimate of the optimal
action â∗T ∈ A based on its prior observations.
The objective of the