Introduction to Sequential Monte Carlo Methods

These notes are based on the following article:

Doucet, Arnaud, Nando de Freitas, and Neil Gordon (2001): “An Introduction to Sequential Monte Carlo Methods,” in Sequential Monte Carlo Methods in Practice, ed. by A. Doucet, N. de Freitas, and N. Gordon. New York: Springer-Verlag.

Introduction

Real-world data analysis often requires estimating unknown quantities given only a sequence of observations on some related observable quantity. In a Bayesian framework, one typically has a priori knowledge of the model: a prior distribution of the unobservable quantity of interest and likelihood functions which relate the observables to the unobservables. The resulting posterior distribution of the unobservables can be calculated using Bayes’ theorem. This allows one to conduct inference about the unobserved quantities.

In some cases, it is natural to process observations sequentially. These cases are the focus of this introductory article and the rest of the book. For example, on-line applications such as radar tracking or estimating financial instruments where new data become available in real time, it is easier to update the previously formed posterior distribution than to recalculate it from scratch.

Sequential Monte Carlo methods are simulation-based methods for calculating approximations to posterior distributions. They avoid making linearity or normality assumptions required by related methods such as the Kalman filter.

Model

Let ${x_{t}}_{t = 0}^{∞}$ denote the sequence of unobserved states, with $x_{t} \in 𝒳$ . The authors consider only the case when ${x_{t}}$ is a Markov process with initial distribution $p (x_{0})$ . The Markov assumption is not needed and is only for tractability. Let $p (x_{t} | x_{t - 1})$ denote the transition equation. Let ${y_{t}}_{t = 1}^{∞}$ denote a sequence of observations with $y_{t} \in 𝒴$ . Suppose that, given ${x_{t}}$ , the observations are conditionally independent. Let $p (y_{t} | x_{t})$ denote the marginal distribution, or the observation equation.

Thus, the model is defined by the following distributions: $\begin{matrix} p (x_{0}), \\ p (x_{t} | x_{t - 1}) for t \geq 1, \\ p (y_{t} | x_{t}) for t \geq 1 . \end{matrix}$

Define $x_{0 : t} \equiv {x_{0}, \dots, x_{t}}$ and $y_{0 : t} \equiv {y_{1}, \dots, y_{t}}$ . Our goal is to estimate the posterior distribution $p (x_{0 : t} | y_{1 : t})$ recursively. We also may care about the marginal distribution $p (x_{t} | y_{1 : t})$ or an expectation such as $I (h_{t}) \equiv E [h_{t} (x_{0 : t}) | y_{1 : t}] = \int h_{t} (x_{0 : t}) p (x_{0 : t} | y_{1 : t}) {dx}_{0 : t}$ for some $h_{t} : 𝒳^{t} \to ℝ^{n}$ that is integrable with respect to $p (x_{0 : t} | y_{1 : t})$ .

Bayes’ theorem gives us the posterior distribution at any time $t$ : $p (x_{0 : t} | y_{1 : t}) = \frac{p (y_{1 : t} | x_{0 : t}) p (x_{0 : t})}{\int p (y_{1 : t} | x_{0 : t}) p (x_{0 : t}) {dx}_{0 : t}}$

There is an inherent recursive relationship for updating the posterior distribution. We can express $p (x_{0 : t + 1} | y_{0 : t + 1})$ in terms of $p (x_{0 : t} | y_{1 : t})$ : $\begin{aligned} p (x_{0 : t + 1} | y_{1 : t + 1}) & = \frac{p (x_{0 : t + 1}, y_{1 : t + 1})}{p (y_{1 : t + 1})} \\ = \frac{p (x_{t + 1}, y_{t + 1} | x_{0 : t}, y_{0 : t}) p (x_{0 : t}, y_{1 : t})}{p (y_{t + 1} | y_{1 : t}) p (y_{1 : t})} \\ = \frac{p (y_{t + 1} | x_{t + 1}) p (x_{t + 1} | x_{t})}{p (y_{t + 1} | y_{1 : t})} p (x_{0 : t} | y_{1 : t}) \end{aligned}$ where the last equality uses the definition of the conditional density, the conditional independence assumption, and the Markov assumption. The expression in the denominator is constant with respect to $x_{0 : t + 1}$ .

Perfect Monte Carlo Sampling

In the ideal situation, we could draw a sample of size $N$ from $p (x_{0 : t} | y_{1 : t})$ , the posterior distribution of interest. Call this sample ${x_{0 : t}^{(i)}}_{i = 1}^{N}$ . We can form an estimate of $p (x_{0 : t} | y_{1 : t})$ using the empirical distribution of the sample: $P_{N} ({dx}_{0 : t} | y_{0 : t}) = N^{- 1} \sum_{i = 1}^{N} δ_{x_{0 : t}^{(i)}} ({dx}_{0 : t}) .$ From here, we can estimate the integral $I (h_{t})$ by $\int h_{t} (x_{0 : t}) P_{N} ({dx}_{0 : t} | y_{1 : t}) = N^{- 1} \sum_{i = 1}^{N} h_{t} (x_{0 : t}^{(i)}) .$

Typically it is impossible to get such a sample since $p (x_{0 : t} | y_{1 : t})$ is multivariate, known only up to a constant of proportionality, and non-standard. Markov chain Monte Carlo methods may be used in similar situations, but they are not well-suited to recursive problems such as the ones considered here.

Importance Sampling

Since we can’t draw from $p (x_{0 : t} | y_{1 : t})$ directly, we can use importance sampling to draw from $p (x_{0 : t} | y_{1 : t})$ indirectly using an importance sampling density $π (x_{0 : t} | y_{1 : t})$ and a corresponding importance weight for each draw. The importance sampling density must be chosen so that its support includes the support of $p (x_{0 : t} | y_{1 : t})$ . We have $\begin{aligned} I (h_{t}) \equiv E [h_{t} (x_{0 : t}) | y_{1 : t}] & = \int h_{t} (x_{0 : t}) p (x_{0 : t} | y_{1 : t}) {dx}_{0 : t} \\ = \frac{\int h_{t} (x_{0 : t}) p (x_{0 : t} | y_{1 : t}) {dx}_{0 : t}}{\int p (x_{0 : t} | y_{1 : t}) {dx}_{0 : t}} \\ = \frac{\int h_{t} (x_{0 : t}) w (x_{0 : t}) π (x_{0 : t} | y_{1 : t}) {dx}_{0 : t}}{\int w (x_{0 : t}) π (x_{0 : t} | y_{1 : t}) {dx}_{0 : t}} \end{aligned}$ where $w (x_{0 : t}) \equiv \frac{p (x_{0 : t} | y_{1 : t})}{π (x_{0 : t} | y_{1 : t})}$ is the importance weight.

Thus, to approximate the expectation $I (h_{t})$ , we simply need to draw a sample ${x_{0 : t}^{(i)}}_{i = 1}^{N}$ from $π (x_{0 : t} | y_{1 : t})$ and calculate $I (h_{t}) = \frac{N^{- 1} \sum_{i} h_{t} (x_{0 : t}) w (x_{0 : t}^{(i)})}{N^{- 1} \sum_{i} w (x_{0 : t}^{(i)})} = \sum_{i} h_{t} (x_{0 : t}) {\tilde{w}}_{t}^{(i)}$ where ${\tilde{w}}_{t}^{(i)} \equiv \frac{w (x_{0 : t})}{\sum_{j} w (x_{0 : t}^{(j)})}$ are the normalized importance weights.

This procedure can be interpreted as a sampling method with an approximate posterior given by ${\hat{P}}_{N} ({dx}_{0 : t} | y_{1 : t}) = \sum_{i} w_{t}^{(i)} δ_{x_{0 : t}} ({dx}_{0 : t}) .$ We can then approximate $I (h_{t})$ by $\hat{I} (h_{t}) = \int h_{t} (x_{0 : t}) {\hat{P}}_{N} ({dx}_{0 : t} | y_{1 : t}) .$

Although this solves the problem of being able to obtain draws, this approach is still not well-suited for our recursive problem.

Sequential Importance Sampling (SIS)

The goal of Sequential Importance Sampling (SIS) is to compute an estimate of $p (x_{0 : t} | y_{1 : t})$ using the past simulated values ${x_{0 : t - 1}^{(i)}}_{i = 1}^{N}$ . We have the following recursive representation of the importance sampling distribution: $\begin{aligned} π (x_{0 : t} | y_{1 : t}) & = π (x_{0}) \prod_{k = 1}^{t} π (x_{k} | x_{0 : k - 1}, y_{0 : k}) . \end{aligned}$ This recursive relationship allows us to calculate the importance weights recursively: $\begin{aligned} w (x_{0 : t}) & \equiv \frac{p (x_{0 : t} | y_{1 : t})}{π (x_{0 : t} | y_{1 : t})} \\ = \frac{p (x_{0 : t}, y_{1 : t})}{p (y_{1 : t}) π (x_{0 : t - 1} | y_{1 : t - 1}) π (x_{t} | x_{0 : t - 1}, y_{1 : t})} \\ = \frac{p (x_{0 : t - 1}, y_{1 : t - 1}) p (x_{t}, y_{t} | x_{0 : t - 1}, y_{1 : t - 1})}{p (y_{t} | y_{1 : t - 1}) p (y_{1 : t - 1}) π (x_{0 : t - 1} | y_{1 : t - 1}) π (x_{t} | x_{0 : t - 1}, y_{1 : t})} \\ = \frac{p (x_{0 : t - 1} | y_{1 : t - 1})}{π (x_{0 : t - 1} | y_{1 : t - 1})} \frac{p (y_{t} | x_{t}) p (x_{t} | x_{0 : t - 1})}{p (y_{t} | y_{1 : t - 1}) π (x_{t} | x_{0 : t - 1}, y_{1 : t})} \\ = w (x_{0 : t - 1}) \frac{p (y_{t} | x_{t}) p (x_{t} | x_{0 : t - 1})}{p (y_{t} | y_{1 : t - 1}) π (x_{t} | x_{0 : t - 1}, y_{1 : t})} . \end{aligned}$ Hence, $\begin{aligned} \tilde{w} (x_{0 : t}^{(i)}) & = \frac{w (x_{0 : t}^{(i)})}{\sum_{j} w (x_{0 : t}^{(j)})} \\ = \frac{w (x_{0 : t}^{(i)}) \frac{p (y_{t} | x_{t}^{(i)}) p (x_{t}^{(i)} | x_{0 : t - 1}^{(i)})}{p (y_{1 : t} | y_{1 : t - 1}) π (x_{t}^{(i)} | x_{0 : t - 1}^{(i)}, y_{1 : t})}}{\sum_{j} w (x_{0 : t}^{(j)}) \frac{p (y_{t} | x_{t}^{(j)}) p (x_{t}^{(j)} | x_{0 : t - 1}^{(j)})}{p (y_{1 : t} | y_{1 : t - 1}) π (x_{t}^{(j)} | x_{0 : t - 1}^{(j)}, y_{1 : t})}} . \end{aligned}$ Notice that the $p (y_{t} | y_{1 : t - 1})$ term is constant for all $x_{0 : t}$ and cancels out.

A special case occurs when $π (x_{0 : t} | y_{1 : t}) = p (x_{0 : t}) = p (x_{0}) \prod_{k = 1}^{k} p (x_{k} | x_{k - 1})$ so that $w (x_{0 : t}) \propto w (x_{0 : t - 1}) p (y_{t} | x_{t})$ and ${\tilde{w}}_{t}^{(i)} \propto w_{t - 1}^{(i)} p (y_{t} | x_{t}^{(i)}) .$

SIS is usually inefficient for high-dimensional integrals because as $t \to ∞$ , the importance weights for some particles quickly approaches zero.

The Bootstrap Filter

To prevent particle degeneracy, the bootstrap filter introduces a resampling step which eliminates particles with low importance weights. Instead of using the importance-weighted empirical distribution ${\hat{P}}_{N}$ we use a uniformly-weighted distribution $P_{N} ({dx}_{0 : t} | y_{1 : t}) = N^{- 1} \sum_{i} N_{t}^{(i)} δ_{x_{0 : t}^{(i)}} ({dx}_{0 : t}),$ where $N_{t}^{(i)}$ is the number of offspring of the particle $x_{0 : t}^{(i)}$ which is to be determined by some branching mechanism. The most common such mechanism involves resampling $N$ times from ${\hat{P}}_{N}$ (Gordon et al., 1993). We require $\sum_{i} N_{t}^{(i)} = N$ for all $t$ . If $N_{t}^{(j)} = 0$ , the particle $x_{0 : t}^{(j)}$ dies. We want to choose $N_{t}^{(i)}$ such that $\int h_{t} (x_{0 : t}) P_{N} ({dx}_{0 : t} | y_{1 : t}) \approx \int h_{t} (x_{0 : t}) {\hat{P}}_{N} ({dx}_{0 : t} | y_{1 : t}) .$ The surviving particles, those with $N_{t}^{(i)} > 0$ , are approximately distributed according to $p (x_{0 : t} | y_{1 : t})$ .

Algorithm

Initialization ( $t = 0$ ).
- For $i = 1, \dots, N$ , draw $x_{0}^{(i)} \sim p (x_{0})$ .
- Set $t \leftarrow 1$ .
Importance Sampling Step
- For $i = 1, \dots, N$ , draw ${\tilde{x}}_{t}^{(i)} \sim p (x_{t} | x_{t - 1}^{(i)})$ and set ${\tilde{x}}_{0 : t}^{(i)} \leftarrow (x_{0 : t - 1}^{(i)}, {\tilde{x}}_{t}^{(i)})$ .
- For $i = 1, \dots, N$ , calculate the importance weights ${\tilde{w}}_{t}^{(i)} = p (y_{t} | {\tilde{x}}_{t}^{(i)})$ and normalize them.
Resampling Step
- Take $N$ draws ${x_{0 : t}^{(i)}}_{i = 1}^{N}$ with replacement from the set ${{\tilde{x}}_{0 : t}^{(i)}}_{i = 1}^{N}$ with weights ${{\tilde{w}}_{t}^{(i)}}_{i = 1}^{N}$ .
- Set $t \leftarrow t + 1$ and go to step 2.