Metropolis-adjusted Langevin algorithm

From HandWiki
Short description: Markov Chain Monte Carlo algorithm

In computational statistics, the Metropolis-adjusted Langevin algorithm (MALA) or Langevin Monte Carlo (LMC) is a Markov chain Monte Carlo (MCMC) method for obtaining random samples – sequences of random observations – from a probability distribution for which direct sampling is difficult. As the name suggests, MALA uses a combination of two mechanisms to generate the states of a random walk that has the target probability distribution as an invariant measure:

Informally, the Langevin dynamics drive the random walk towards regions of high probability in the manner of a gradient flow, while the Metropolis–Hastings accept/reject mechanism improves the mixing and convergence properties of this random walk. MALA was originally proposed by Julian Besag in 1994,[1] (although the method Smart Monte Carlo was already introduced in 1978 [2]) and its properties were examined in detail by Gareth Roberts together with Richard Tweedie[3] and Jeff Rosenthal.[4] Many variations and refinements have been introduced since then, e.g. the manifold variant of Girolami and Calderhead (2011).[5] The method is equivalent to using the Hamiltonian Monte Carlo (hybrid Monte Carlo) algorithm with only a single discrete time step.[5]

Further details

Let [math]\displaystyle{ \pi }[/math] denote a probability density function on [math]\displaystyle{ \mathbb{R}^{d} }[/math], one from which it is desired to draw an ensemble of independent and identically distributed samples. We consider the overdamped Langevin Itô diffusion

[math]\displaystyle{ \dot{X} = \nabla \log \pi(X) + \sqrt{2} \dot{W} }[/math]

driven by the time derivative of a standard Brownian motion [math]\displaystyle{ W }[/math]. (Note that another commonly-used normalization for this diffusion is

[math]\displaystyle{ \dot{X} = \frac{1}{2} \nabla \log \pi(X) + \dot{W}, }[/math]

which generates the same dynamics.) In the limit as [math]\displaystyle{ t \to \infty }[/math], this probability distribution [math]\displaystyle{ \rho(t) }[/math] of [math]\displaystyle{ X(t) }[/math] approaches a stationary distribution, which is also invariant under the diffusion, which we denote [math]\displaystyle{ \rho_\infty }[/math]. It turns out that, in fact, [math]\displaystyle{ \rho_\infty = \pi }[/math].

Approximate sample paths of the Langevin diffusion can be generated by many discrete-time methods. One of the simplest is the Euler–Maruyama method with a fixed time step [math]\displaystyle{ \tau \gt 0 }[/math]. We set [math]\displaystyle{ X_0 := x_0 }[/math] and then recursively define an approximation [math]\displaystyle{ X_k }[/math] to the true solution [math]\displaystyle{ X(k \tau) }[/math] by

[math]\displaystyle{ X_{k + 1} := X_k + \tau \nabla \log \pi(X_k) + \sqrt{2 \tau} \xi_k, }[/math]

where each [math]\displaystyle{ \xi_{k} }[/math] is an independent draw from a multivariate normal distribution on [math]\displaystyle{ \mathbb{R}^{d} }[/math] with mean 0 and covariance matrix equal to the [math]\displaystyle{ d \times d }[/math] identity matrix. Note that [math]\displaystyle{ X_{k + 1} }[/math] is normally distributed with mean [math]\displaystyle{ X_k + \tau \nabla \log \pi(X_k) }[/math] and covariance equal to [math]\displaystyle{ 2 \tau }[/math] times the [math]\displaystyle{ d \times d }[/math] identity matrix.

In contrast to the Euler–Maruyama method for simulating the Langevin diffusion, which always updates [math]\displaystyle{ X_k }[/math] according to the update rule

[math]\displaystyle{ X_{k + 1} := X_k + \tau \nabla \log \pi(X_k) + \sqrt{2 \tau} \xi_k, }[/math]

MALA incorporates an additional step. We consider the above update rule as defining a proposal [math]\displaystyle{ \tilde{X}_{k + 1} }[/math] for a new state,

[math]\displaystyle{ \tilde{X}_{k + 1} := X_k + \tau \nabla \log \pi(X_k) + \sqrt{2 \tau} \xi_k. }[/math]

This proposal is accepted or rejected according to the Metropolis-Hastings algorithm: set

[math]\displaystyle{ \alpha := \min \left\{ 1 , \frac{\pi(\tilde{X}_{k + 1}) q(X_{k}\mid\tilde{X}_{k + 1})}{\pi({X}_{k}) q(\tilde{X}_{k + 1}\mid X_k)} \right\}, }[/math]

where

[math]\displaystyle{ q(x'\mid x) \propto \exp \left( - \frac{1}{4 \tau} \| x' - x - \tau \nabla \log \pi(x) \|_2^2 \right) }[/math]

is the transition probability density from [math]\displaystyle{ x }[/math] to [math]\displaystyle{ x' }[/math] (note that, in general [math]\displaystyle{ q(x'\mid x) \neq q(x\mid x') }[/math]). Let [math]\displaystyle{ u }[/math] be drawn from the continuous uniform distribution on the interval [math]\displaystyle{ [0, 1] }[/math]. If [math]\displaystyle{ u \leq \alpha }[/math], then the proposal is accepted, and we set [math]\displaystyle{ X_{k + 1} := \tilde{X}_{k + 1} }[/math]; otherwise, the proposal is rejected, and we set [math]\displaystyle{ X_{k + 1} := X_k }[/math].

The combined dynamics of the Langevin diffusion and the Metropolis–Hastings algorithm satisfy the detailed balance conditions necessary for the existence of a unique, invariant, stationary distribution [math]\displaystyle{ \rho_{\infty} = \pi }[/math]. Compared to naive Metropolis–Hastings, MALA has the advantage that it usually proposes moves into regions of higher [math]\displaystyle{ \pi }[/math] probability, which are then more likely to be accepted. On the other hand, when [math]\displaystyle{ \pi }[/math] is strongly anisotropic (i.e. it varies much more quickly in some directions than others), it is necessary to take [math]\displaystyle{ 0 \lt \tau \ll 1 }[/math] in order to properly capture the Langevin dynamics; the use of a positive-definite preconditioning matrix [math]\displaystyle{ A \in \mathbb{R}^{d \times d} }[/math] can help to alleviate this problem, by generating proposals according to

[math]\displaystyle{ \tilde{X}_{k + 1} := X_k + \tau A \nabla \log \pi(X_k) + \sqrt{2 \tau A} \xi_k, }[/math]

so that [math]\displaystyle{ \tilde{X}_{k + 1} }[/math] has mean [math]\displaystyle{ X_k + \tau A \nabla \log \pi(X_k) }[/math] and covariance [math]\displaystyle{ 2 \tau A }[/math].

For limited classes of target distributions, the optimal acceptance rate for this algorithm can be shown to be [math]\displaystyle{ 0.574 }[/math]; if it is discovered to be substantially different in practice, [math]\displaystyle{ \tau }[/math] should be modified accordingly.[4]

References

  1. J. Besag (1994). "Comments on "Representations of knowledge in complex systems" by U. Grenander and MI Miller". Journal of the Royal Statistical Society, Series B 56: 591–592. 
  2. Rossky, P.J.; Doll, J.D.; Friedman, H.L. (1978). "Brownian Dynamics as smart Monte Carlo simulation". J. Chem. Physics 69 (10): 4628. doi:10.1063/1.436415. 
  3. G. O. Roberts and R. L. Tweedie (1996). "Exponential convergence of Langevin distributions and their discrete approximations". Bernoulli 2 (4): 341–363. doi:10.2307/3318418. http://projecteuclid.org/euclid.bj/1178291835. 
  4. 4.0 4.1 G. O. Roberts and J. S. Rosenthal (1998). "Optimal scaling of discrete approximations to Langevin diffusions". Journal of the Royal Statistical Society, Series B 60 (1): 255–268. doi:10.1111/1467-9868.00123. 
  5. 5.0 5.1 M. Girolami and B. Calderhead (2011). "Riemann manifold Langevin and Hamiltonian Monte Carlo methods". Journal of the Royal Statistical Society, Series B 73 (2): 123–214. doi:10.1111/j.1467-9868.2010.00765.x.