Sampling and MCMC

From "minimize $f$" to "draw samples from $\exp(-f)$", and what we get from doing so.

From optimization to sampling

Start with a constrained optimization problem:

$$\min_{\mathbf{x}} \; f(\mathbf{x}) \quad \text{s.t.} \quad c(\mathbf{x}) = \mathbf{0}, \;\; \mathbf{x} \ge \boldsymbol{\ell}.$$

We are going to encode this entire problem as a single probability density on $\mathbb{R}^n$. The trick is to define

$$P(\mathbf{x}) \;=\; \frac{1}{Z}\, \exp\!\bigl(-f(\mathbf{x})\bigr) \cdot \mathbf{1}\!\left[c(\mathbf{x}) = \mathbf{0}\right] \cdot \mathbf{1}\!\left[\mathbf{x} \ge \boldsymbol{\ell}\right]$$

where $Z$ is whatever normalizing constant makes this integrate to one. Three things happen at once:

The mode (most likely point) of $P$ is therefore exactly the global minimizer of $f$ subject to the constraints. The hard problem of constrained global minimization has been re-expressed as a problem of finding the peak of a probability density.

Why is this a useful re-expression? Because we have algorithms — Markov chain Monte Carlo — that can draw samples from $P$ even when we only know $P$ up to the unknown constant $Z$, and even when we cannot evaluate any integral of $P$. We will see that all those samples need is the ability to evaluate $f$ and check the constraints.

A 1-D example

Take $f(x) = \tfrac{1}{8}(x^2 - 4)^2 - 0.3\,x$ on the interval $x \in [-3, 3]$ (a "double well" with a slight tilt). $f$ has two local minima near $x \approx -2$ and $x \approx 2$; the tilt makes the right basin a tiny bit deeper, so the global minimizer is near $x = 2$.

Now plot $P(x) \propto \exp(-f(x)) \cdot \mathbf{1}[-3 \le x \le 3]$ next to $f$:

Two things to read off the picture:

If we had a way to sample from $P$, then most of our samples would land near the global min, fewer near the local min, and very few elsewhere. That is the connection between sampling and optimization. Designing a way to sample from $P$ is the job of Metropolis-Hastings.

The Metropolis-Hastings algorithm

We construct a Markov chain whose stationary distribution is $P$. Given the current state $\mathbf{x}$:

  1. Propose $\mathbf{x}' = \mathbf{x} + \tau\, \boldsymbol{\eta}$ where $\boldsymbol{\eta} \sim \mathcal{N}(\mathbf{0}, I)$.
  2. If $\mathbf{x}'$ violates a constraint ($c(\mathbf{x}') \ne \mathbf{0}$ or $\mathbf{x}' \not\ge \boldsymbol{\ell}$), reject — the indicators make $P(\mathbf{x}') = 0$. Otherwise compute $$a = \min\!\left(1,\; \frac{P(\mathbf{x}')}{P(\mathbf{x})}\right) = \min\!\left(1,\; \exp\!\bigl(-\bigl(f(\mathbf{x}') - f(\mathbf{x})\bigr)\bigr)\right).$$
  3. Accept (set $\mathbf{x} \leftarrow \mathbf{x}'$) with probability $a$; otherwise stay put.

Notice that the unknown $Z$ cancels in the ratio $P(\mathbf{x}')/P(\mathbf{x})$ — we only ever need to evaluate $f$ at two points. The single tuning knob is $\tau$, the proposal step size. The asymmetry between "downhill is always accepted" and "uphill is sometimes accepted" is what turns this into sampling instead of greedy descent.

This is one of the few algorithms where staying put is normal. Many iterations the chain rejects and writes the same point twice. Don't dedupe — the repetition is the probability.

The code in three languages

The full sampler is about 15 lines. Side by side:

function mcmc!(f, x0, tau, hist)
    niter = size(hist, 2) - 1
    n = length(x0)

    x    = copy(x0)
    curf = f(x)
    bestx, bestf = copy(x), curf
    hist[:, 1] = [curf; x]

    for iter in 1:niter
        xn    = x .+ tau .* randn(n)
        nextf = f(xn)
        a     = exp(-(nextf - curf))    # T = 1 here
        if a > 1 || rand() <= a
            x, curf = xn, nextf
        end
        if curf < bestf
            bestx, bestf = copy(x), curf
        end
        hist[:, iter+1] = [curf; x]
    end
    return bestf, bestx
end

The Julia version is the original from this lecture's notebook; the others are direct translations. Notice that the algorithm does not depend on dimension — only the proposal and the function evaluation do.

Metropolis-Hastings on Rosenbrock

Watch the chain explore the Rosenbrock banana. Drag $\tau$ to see what happens when the proposal is too small (sticky) or too large (rejected). Use the iteration slider to scrub through the trajectory.

Reading the diagnostics

The trace plot $f(\mathbf{x}_t)$ over iterations and the autocorrelation plot tell you when the chain has "found" the basin and started sampling. We compute the empirical autocorrelation of one coordinate $x_1$:

$$\hat r_k = \frac{\sum_{t=1}^{T-k} (x_{1,t} - \bar x_1)(x_{1,t+k} - \bar x_1)}{\sum_{t=1}^{T} (x_{1,t} - \bar x_1)^2}.$$

Three regimes:

$\tau$What happensTraceAutocorrelation
too small almost every step accepted, but you barely move smooth, drifting $\hat r_k$ stays near 1 for many lags
well-tuned ~25% acceptance; covers the basin noisy, stationary-looking $\hat r_k$ decays in a few dozen lags
too large almost every step rejected; chain stuck flat plateaus with rare jumps $\hat r_k$ stays near 1 for many lags

The "you are sampling the basin" sign is autocorrelation that decays to (around) zero within a small fraction of your run length and a trace plot whose mean is roughly constant.

Folklore acceptance rate: Roberts, Gelman, and Wilks showed that for Gaussian targets, the random-walk MH proposal that maximizes effective sample size has acceptance rate $\approx 0.234$ in high dimensions. In 1D it is closer to $0.44$. "Tune $\tau$ until you accept about a quarter of the time" is a surprisingly portable heuristic.

Ideas to do better than vanilla MH

The random-walk proposal is a starting point, not the destination. Things you can try:

Simulated annealing — the same algorithm, but cooling

Replace the constant $T = 1$ with a schedule $T_t \to 0$. The acceptance ratio becomes

$$a_t = \min\!\left(1, \exp\!\left(-\tfrac{f(\mathbf{x}') - f(\mathbf{x})}{T_t}\right)\right).$$

At high $T$, the chain freely climbs uphill and explores. As $T \to 0$, it accepts only improvements — pure greedy descent. Slow enough cooling provably finds the global optimum, but "slow enough" is exponential in $n$, so this is an existence proof, not an algorithm.

Try the "anneal" schedule in the demo. Notice the trajectory tightens as iterations proceed.

Same code, different schedule. The Julia anneal! function in the original notebook differs from mcmc! by one line: the temperature in the acceptance ratio. That is the entire conceptual content of simulated annealing.

Where to read next