From "minimize $f$" to "draw samples from $\exp(-f)$", and what we get from doing so.
Start with a constrained optimization problem:
$$\min_{\mathbf{x}} \; f(\mathbf{x}) \quad \text{s.t.} \quad c(\mathbf{x}) = \mathbf{0}, \;\; \mathbf{x} \ge \boldsymbol{\ell}.$$We are going to encode this entire problem as a single probability density on $\mathbb{R}^n$. The trick is to define
$$P(\mathbf{x}) \;=\; \frac{1}{Z}\, \exp\!\bigl(-f(\mathbf{x})\bigr) \cdot \mathbf{1}\!\left[c(\mathbf{x}) = \mathbf{0}\right] \cdot \mathbf{1}\!\left[\mathbf{x} \ge \boldsymbol{\ell}\right]$$where $Z$ is whatever normalizing constant makes this integrate to one. Three things happen at once:
The mode (most likely point) of $P$ is therefore exactly the global minimizer of $f$ subject to the constraints. The hard problem of constrained global minimization has been re-expressed as a problem of finding the peak of a probability density.
Take $f(x) = \tfrac{1}{8}(x^2 - 4)^2 - 0.3\,x$ on the interval $x \in [-3, 3]$ (a "double well" with a slight tilt). $f$ has two local minima near $x \approx -2$ and $x \approx 2$; the tilt makes the right basin a tiny bit deeper, so the global minimizer is near $x = 2$.
Now plot $P(x) \propto \exp(-f(x)) \cdot \mathbf{1}[-3 \le x \le 3]$ next to $f$:
Two things to read off the picture:
If we had a way to sample from $P$, then most of our samples would land near the global min, fewer near the local min, and very few elsewhere. That is the connection between sampling and optimization. Designing a way to sample from $P$ is the job of Metropolis-Hastings.
We construct a Markov chain whose stationary distribution is $P$. Given the current state $\mathbf{x}$:
Notice that the unknown $Z$ cancels in the ratio $P(\mathbf{x}')/P(\mathbf{x})$ — we only ever need to evaluate $f$ at two points. The single tuning knob is $\tau$, the proposal step size. The asymmetry between "downhill is always accepted" and "uphill is sometimes accepted" is what turns this into sampling instead of greedy descent.
The full sampler is about 15 lines. Side by side:
function mcmc!(f, x0, tau, hist)
niter = size(hist, 2) - 1
n = length(x0)
x = copy(x0)
curf = f(x)
bestx, bestf = copy(x), curf
hist[:, 1] = [curf; x]
for iter in 1:niter
xn = x .+ tau .* randn(n)
nextf = f(xn)
a = exp(-(nextf - curf)) # T = 1 here
if a > 1 || rand() <= a
x, curf = xn, nextf
end
if curf < bestf
bestx, bestf = copy(x), curf
end
hist[:, iter+1] = [curf; x]
end
return bestf, bestx
end
The Julia version is the original from this lecture's notebook; the others are direct translations. Notice that the algorithm does not depend on dimension — only the proposal and the function evaluation do.
The trace plot $f(\mathbf{x}_t)$ over iterations and the autocorrelation plot tell you when the chain has "found" the basin and started sampling. We compute the empirical autocorrelation of one coordinate $x_1$:
$$\hat r_k = \frac{\sum_{t=1}^{T-k} (x_{1,t} - \bar x_1)(x_{1,t+k} - \bar x_1)}{\sum_{t=1}^{T} (x_{1,t} - \bar x_1)^2}.$$Three regimes:
| $\tau$ | What happens | Trace | Autocorrelation |
|---|---|---|---|
| too small | almost every step accepted, but you barely move | smooth, drifting | $\hat r_k$ stays near 1 for many lags |
| well-tuned | ~25% acceptance; covers the basin | noisy, stationary-looking | $\hat r_k$ decays in a few dozen lags |
| too large | almost every step rejected; chain stuck | flat plateaus with rare jumps | $\hat r_k$ stays near 1 for many lags |
The "you are sampling the basin" sign is autocorrelation that decays to (around) zero within a small fraction of your run length and a trace plot whose mean is roughly constant.
The random-walk proposal is a starting point, not the destination. Things you can try:
Replace the constant $T = 1$ with a schedule $T_t \to 0$. The acceptance ratio becomes
$$a_t = \min\!\left(1, \exp\!\left(-\tfrac{f(\mathbf{x}') - f(\mathbf{x})}{T_t}\right)\right).$$At high $T$, the chain freely climbs uphill and explores. As $T \to 0$, it accepts only improvements — pure greedy descent. Slow enough cooling provably finds the global optimum, but "slow enough" is exponential in $n$, so this is an existence proof, not an algorithm.
Try the "anneal" schedule in the demo. Notice the trajectory tightens as iterations proceed.
anneal! function in the original notebook differs from mcmc! by one line: the temperature in the acceptance ratio. That is the entire conceptual content of simulated annealing.
healthyalgorithms.com). A practical, hands-on tour of the MCMC zoo: random walk, slice sampling, adaptive Metropolis, parallel tempering, and PyMC. The code style is unfussy and the diagnostics are honest about when chains fail.emcee package, ubiquitous in astrophysics ("the MCMC hammer," Foreman-Mackey et al. 2013).