Lecture 30 — Sampling and MCMC

From optimization to sampling

Start with a constrained optimization problem:

$$\min_{\mathbf{x}} \; f(\mathbf{x}) \quad \text{s.t.} \quad c(\mathbf{x}) = \mathbf{0}, \;\; \mathbf{x} \ge \boldsymbol{\ell}.$$

We are going to encode this entire problem as a single probability density on $\mathbb{R}^n$. The trick is to define

$$P(\mathbf{x}) \;=\; \frac{1}{Z}\, \exp\!\bigl(-f(\mathbf{x})\bigr) \cdot \mathbf{1}\!\left[c(\mathbf{x}) = \mathbf{0}\right] \cdot \mathbf{1}\!\left[\mathbf{x} \ge \boldsymbol{\ell}\right]$$

where $Z$ is whatever normalizing constant makes this integrate to one. Three things happen at once:

The indicator $\mathbf{1}[c(\mathbf{x}) = \mathbf{0}]$ zeros out everything that violates the equality constraint.
The indicator $\mathbf{1}[\mathbf{x} \ge \boldsymbol{\ell}]$ zeros out everything that violates the lower bound.
On the feasible set, $\exp(-f)$ is monotonically decreasing in $f$, so smaller $f$ means larger $P$.

The mode (most likely point) of $P$ is therefore exactly the global minimizer of $f$ subject to the constraints. The hard problem of constrained global minimization has been re-expressed as a problem of finding the peak of a probability density.

Why is this a useful re-expression? Because we have algorithms — Markov chain Monte Carlo — that can draw samples from $P$ even when we only know $P$ up to the unknown constant $Z$, and even when we cannot evaluate any integral of $P$. We will see that all those samples need is the ability to evaluate $f$ and check the constraints.

A 1-D example

Take $f(x) = \tfrac{1}{8}(x^2 - 4)^2 - 0.3\,x$ on the interval $x \in [-3, 3]$ (a "double well" with a slight tilt). $f$ has two local minima near $x \approx -2$ and $x \approx 2$; the tilt makes the right basin a tiny bit deeper, so the global minimizer is near $x = 2$.

Now plot $P(x) \propto \exp(-f(x)) \cdot \mathbf{1}[-3 \le x \le 3]$ next to $f$:

Two things to read off the picture:

$P$ is zero outside $[-3, 3]$ — the indicator wiped out everything else.
$P$ has its tallest peak at the global minimum, a smaller peak at the local minimum, and is essentially flat near the saddle.

If we had a way to sample from $P$, then most of our samples would land near the global min, fewer near the local min, and very few elsewhere. That is the connection between sampling and optimization. Designing a way to sample from $P$ is the job of Metropolis-Hastings.

The Metropolis-Hastings algorithm

We construct a Markov chain whose stationary distribution is $P$. Given the current state $\mathbf{x}$:

Propose $\mathbf{x}' = \mathbf{x} + \tau\, \boldsymbol{\eta}$ where $\boldsymbol{\eta} \sim \mathcal{N}(\mathbf{0}, I)$.
If $\mathbf{x}'$ violates a constraint ($c(\mathbf{x}') \ne \mathbf{0}$ or $\mathbf{x}' \not\ge \boldsymbol{\ell}$), reject — the indicators make $P(\mathbf{x}') = 0$. Otherwise compute $$a = \min\!\left(1,\; \frac{P(\mathbf{x}')}{P(\mathbf{x})}\right) = \min\!\left(1,\; \exp\!\bigl(-\bigl(f(\mathbf{x}') - f(\mathbf{x})\bigr)\bigr)\right).$$
Accept (set $\mathbf{x} \leftarrow \mathbf{x}'$) with probability $a$; otherwise stay put.

Notice that the unknown $Z$ cancels in the ratio $P(\mathbf{x}')/P(\mathbf{x})$ — we only ever need to evaluate $f$ at two points. The single tuning knob is $\tau$, the proposal step size. The asymmetry between "downhill is always accepted" and "uphill is sometimes accepted" is what turns this into sampling instead of greedy descent.

This is one of the few algorithms where staying put is normal. Many iterations the chain rejects and writes the same point twice. Don't dedupe — the repetition is the probability.

The code in three languages

The full sampler is about 15 lines. Side by side:

function mcmc!(f, x0, tau, hist)
    niter = size(hist, 2) - 1
    n = length(x0)

    x    = copy(x0)
    curf = f(x)
    bestx, bestf = copy(x), curf
    hist[:, 1] = [curf; x]

    for iter in 1:niter
        xn    = x .+ tau .* randn(n)
        nextf = f(xn)
        a     = exp(-(nextf - curf))    # T = 1 here
        if a > 1 || rand() <= a
            x, curf = xn, nextf
        end
        if curf < bestf
            bestx, bestf = copy(x), curf
        end
        hist[:, iter+1] = [curf; x]
    end
    return bestf, bestx
end

import numpy as np

def mcmc(f, x0, tau, niter):
    x    = np.asarray(x0, dtype=float).copy()
    curf = f(x)
    bestx, bestf = x.copy(), curf
    hist = np.zeros((1 + len(x0), niter + 1))
    hist[:, 0] = np.r_[curf, x]

    for it in range(1, niter + 1):
        xn    = x + tau * np.random.randn(len(x0))
        nextf = f(xn)
        a     = np.exp(-(nextf - curf))   # T = 1 here
        if a > 1 or np.random.rand() <= a:
            x, curf = xn, nextf
        if curf < bestf:
            bestx, bestf = x.copy(), curf
        hist[:, it] = np.r_[curf, x]

    return bestf, bestx, hist

function mcmc(f, x0, tau, niter) {
  const n = x0.length;
  let x    = x0.slice();
  let curf = f(x);
  let bestx = x.slice(), bestf = curf;
  const hist = [];
  hist.push([curf, ...x]);

  for (let it = 1; it <= niter; it++) {
    const xn = x.map(v => v + tau * randn());
    const nextf = f(xn);
    const a = Math.exp(-(nextf - curf));   // T = 1 here
    if (a > 1 || Math.random() <= a) {
      x = xn; curf = nextf;
    }
    if (curf < bestf) { bestx = x.slice(); bestf = curf; }
    hist.push([curf, ...x]);
  }
  return { bestf, bestx, hist };
}

The Julia version is the original from this lecture's notebook; the others are direct translations. Notice that the algorithm does not depend on dimension — only the proposal and the function evaluation do.

Reading the diagnostics

The trace plot $f(\mathbf{x}_t)$ over iterations and the autocorrelation plot tell you when the chain has "found" the basin and started sampling. We compute the empirical autocorrelation of one coordinate $x_1$:

$$\hat r_k = \frac{\sum_{t=1}^{T-k} (x_{1,t} - \bar x_1)(x_{1,t+k} - \bar x_1)}{\sum_{t=1}^{T} (x_{1,t} - \bar x_1)^2}.$$

Three regimes:

$\tau$	What happens	Trace	Autocorrelation
too small	almost every step accepted, but you barely move	smooth, drifting	$\hat r_k$ stays near 1 for many lags
well-tuned	~25% acceptance; covers the basin	noisy, stationary-looking	$\hat r_k$ decays in a few dozen lags
too large	almost every step rejected; chain stuck	flat plateaus with rare jumps	$\hat r_k$ stays near 1 for many lags

The "you are sampling the basin" sign is autocorrelation that decays to (around) zero within a small fraction of your run length and a trace plot whose mean is roughly constant.

Folklore acceptance rate: Roberts, Gelman, and Wilks showed that for Gaussian targets, the random-walk MH proposal that maximizes effective sample size has acceptance rate $\approx 0.234$ in high dimensions. In 1D it is closer to $0.44$. "Tune $\tau$ until you accept about a quarter of the time" is a surprisingly portable heuristic.

Ideas to do better than vanilla MH

The random-walk proposal is a starting point, not the destination. Things you can try:

Use the gradient (MALA / Langevin). Propose $\mathbf{x}' = \mathbf{x} - \tfrac{\tau^2}{2}\nabla f(\mathbf{x}) + \tau\, \boldsymbol{\eta}.$ The drift biases the proposal toward lower $f$. The acceptance rule needs an extra correction for the asymmetric proposal density. Try the "use gradient" toggle in the demo.
Hamiltonian Monte Carlo (HMC). Augment with momentum and integrate Hamilton's equations. Long, informed jumps. The basis of Stan and PyMC.
Adaptive proposals. Estimate the empirical covariance of the chain so far and propose with that scale. Robust adaptive MCMC by Roberts and Rosenthal.
Block / Gibbs. Update one coordinate at a time, conditional on the others. Cheap when the conditional distributions are easy.
Parallel-tempering, ensemble samplers. Run multiple chains at different temperatures (or in different "stretch" geometries) and let them swap. Discussed below.
Change the proposal toward "better points only." If you increase the bias toward downhill until uphill moves never happen — that is simulated annealing.

Simulated annealing — the same algorithm, but cooling

Replace the constant $T = 1$ with a schedule $T_t \to 0$. The acceptance ratio becomes

$$a_t = \min\!\left(1, \exp\!\left(-\tfrac{f(\mathbf{x}') - f(\mathbf{x})}{T_t}\right)\right).$$

At high $T$, the chain freely climbs uphill and explores. As $T \to 0$, it accepts only improvements — pure greedy descent. Slow enough cooling provably finds the global optimum, but "slow enough" is exponential in $n$, so this is an existence proof, not an algorithm.

Try the "anneal" schedule in the demo. Notice the trajectory tightens as iterations proceed.

Same code, different schedule. The Julia anneal! function in the original notebook differs from mcmc! by one line: the temperature in the acceptance ratio. That is the entire conceptual content of simulated annealing.

Where to read next

Abraham Flaxman's "MCMC in Python" series (~2010, blog: healthyalgorithms.com). A practical, hands-on tour of the MCMC zoo: random walk, slice sampling, adaptive Metropolis, parallel tempering, and PyMC. The code style is unfussy and the diagnostics are honest about when chains fail.
Goodman & Weare, "Ensemble samplers with affine invariance" (2010). Run an ensemble of "walkers" that propose moves based on the positions of other walkers. The proposal is invariant under affine transforms, so it does not need scale tuning per coordinate. This is the algorithm in the emcee package, ubiquitous in astrophysics ("the MCMC hammer," Foreman-Mackey et al. 2013).
Roberts & Rosenthal, "Examples of adaptive MCMC" (2009). When and how it is safe to tune $\tau$ on the fly without breaking stationarity.
Neal, "MCMC using Hamiltonian dynamics" (2011, Handbook of MCMC). The reference for HMC.