# The Metropolis-Hastings Algorithm

The **Metropolis-Hastings algorithm**, developed by Metropolis,
Rosenbluth, Rosenbluth, Teller, and Teller (1953) and generalized by
Hastings (1970), is a Markov chain Monte Carlo method which allows for
sampling from a distribution when traditional sampling methods such as
transformation or inversion fail. It requires only being able to
evaluate the density function. The normalizing constant need not be
known. This algorithm is very general and gives rise to the Gibbs
sampler as a special case.

## Markov Chain Monte Carlo Simulation

Consider a Markov transition kernel $P(x,A)$ for $x\in {\mathbb{R}}^{d}$ and $A\in \mathcal{B}$, where $\mathcal{B}$ is the Borel $\sigma $-algebra on ${\mathbb{R}}^{d}$. Two major concerns in Markov chain theory are whether there exists an invariant distribution andf whether iterating the transition kernel converges to the invariant distribution. The invariant distribution ${\pi}_{0}$ satisfies $${\pi}_{0}(\mathrm{dy})={\int}_{{\mathbb{R}}^{d}}P(x,\mathrm{dy})\pi (x)\phantom{\rule{thinmathspace}{0ex}}\mathrm{dx},$$ where $\pi $ is the density of ${\pi}_{0}$ with respect to Lebesgue measure. Let $${P}^{(n)}(x,A)={\int}_{{\mathbb{R}}^{d}}{P}^{(n-1)}(x,\mathrm{dy})P(y,A)$$ denote the $n$-th iteration of the transition kernel. We also want ${P}^{(n)}(x,A)$ to converge to ${\pi}_{0}$ as $n\to \mathrm{\infty}$. That is, the distribution of the draws generated by the iterations is approximately ${\pi}_{0}$.

Markov chain Monte Carlo methods look at the problem from the opposite
perspective. The invariant distribution is known: it is the target
distribution. The problem is how to generate an appropriate
transition kernel with the aforementioned convergence property.
Suppose that the transition kernel can be expressed as
$$P(x,\mathrm{dy})=p(x,y)\phantom{\rule{thinmathspace}{0ex}}\mathrm{dy}+r(x){\delta}_{x}(\mathrm{dy})$$
where $p(x,x)=0$, ${\delta}_{x}(\mathrm{dy})=1$ if $x\in \mathrm{dy}$ and $0$
otherwise, and define $r(x)=1-{\int}_{{\mathbb{R}}^{d}}p(x,y)\phantom{\rule{thinmathspace}{0ex}}\mathrm{dy}$.
The $r(x)$ term represents the probability of staying at $x$ which may
be nonzero. If this kernel satisfies the reversibility condition
$$label\mathrm{reversibility}\pi (x)p(x,y)=\pi (y)p(y,x),$$
then $\pi $ is the invariant distribution of $P(x,\cdot )$. The
**Metropolis-Hastings algorithm** generates a $p(x,y)$ with this
property.

## The Metropolis-Hastings Algorithm

Let $q(x,y)$ denote the *candidate-generating density*, where $\int q(x,y)\phantom{\rule{thinmathspace}{0ex}}\mathrm{dy}=1$. This is essentially the conditional density of $y$
given $x$. This density could potentially be used as the $p(x,y)$
term in the transition kernel, but it may not satisfy
(eq:reversibility). If, for example, we have
$$label\mathrm{inequality}\pi (x)q(x,y)>\pi (y)q(y,x),$$
then we can adjust $q$ by using a *probability of move* $\alpha (x,y)$.
Transitions will be made using
$${p}_{\mathrm{MH}}(x,y)\equiv q(x,y)\alpha (x,y)$$
for $x\ne y$.

The choice of $\alpha $ follows the following logic. If
(eq:inequality) holds, then moves from $x$ to $y$ are happening too
often under $q$. We should thus choose $\alpha (y,x)=1$. But then,
in order to satisfy (eq:reversibility), we must have
$$\begin{array}{rl}\pi (x)q(x,y)\alpha (x,y)& =\pi (y)q(y,x)\alpha (y,x)\\ & =\pi (y)q(y,x).\end{array}$$
This implies that
$$\alpha (x,y)=\frac{\pi (y)q(y,x)}{\pi (x)q(x,y)}.$$
Conversely, we can consider the case when the inequality in
(eq:reversibility) were reversed to derive $\alpha (y,x)$.

Thus, to summarize, in order to satisfy reversibility, we set
$$label\mathrm{alpha}\alpha (x,y)=\{\begin{array}{ll}\mathrm{min}[\frac{\pi (y)q(y,x)}{\pi (x)q(x,y)},1],& \text{if}\phantom{\rule{thickmathspace}{0ex}}\pi (x)q(x,y)>0\\ 1,& \text{otherwise.}\end{array}$$
Hence, the desired transition kernel is
$$label\mathrm{kernel}{P}_{\mathrm{MH}}(x,\mathrm{dy})=q(x,y)\alpha (x,y)\phantom{\rule{thinmathspace}{0ex}}\mathrm{dy}+[1-{\int}_{{\mathbb{R}}^{d}}q(x,y)\alpha (x,y)\phantom{\rule{thinmathspace}{0ex}}\mathrm{dy}]{\delta}_{x}(\mathrm{dy}).$$

Thus, the *Metropolis-Hastings algorithm* is defined by the
candidate-generating density $q(x,y)$. Note that $\alpha (x,y)$ does
not require knowledge of the normalizing constant because it drops out
of the ratio $\pi (y)/\pi (x)$. A special case arises when the
candidate-generating density is symmetric: $q(x,y)=q(y,x)$ since the
probability of move reduces to $\pi (y)/\pi (x)$. This special case
forms the basis for optimization algorithms such as Simulated
annealing.

The algorithm proceeds as follows: Given an initial value ${x}^{(0)}$, for each $j=1,2,\dots ,N$

- Draw $y$ from $q({x}^{(j)},\cdot )$ and $u$ from $U(0,1)$
- If $u\le \alpha ({x}^{(j)},y)$, set ${x}^{(j+1)}=y$.
- Otherwise, set ${x}^{(j+1)}={x}^{(j)}$.

## References

Chib, S. and E. Greenberg (1995): “Understanding the Metropolis Hastings Algorithm,”

*American Statistical Journal*, 49, 327–335.Hastings, W.K. (1970): “Monte Carlo Sampling Methods Using Markov Chains and Their Applications,”

*Biometrika*, 57, 97–109.Metropolis, N., A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller (1953): “Equations of State Calculations by Fast Computing Machines,”

*Journal of Chemical Physics*, 21, 1087–1092.Robert, C.P., and G. Casella (2004):

*Monte Carlo Statistical Methods*, Second Edition. New York: Springer.