Lecture 23 — Penalty & Augmented Lagrangian Methods

The penalty method

Given the equality-constrained problem:

$$\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{s.t.} \quad \mathbf{c}(\mathbf{x}) = \mathbf{0}$$

We replace the constraints with a penalty term:

$$\min_{\mathbf{x}}\; f(\mathbf{x}) + \mu \sum_i c_i(\mathbf{x})^2 \quad \longleftrightarrow \quad \min_{\mathbf{x}}\; f(\mathbf{x}) + \mu\,\mathbf{c}(\mathbf{x})^T\mathbf{c}(\mathbf{x})$$

Large $\mu$ penalizes constraint violation heavily, pushing the solution toward feasibility.

The penalty algorithm

Let {τ_k} → 0, {μ_k} → ∞
While ||c(x_k)|| ≥ tol:
    Solve min f(x) + μ_k/2 · c(x)^Tc(x) // to gradient norm ≤ τ_k
    Set x^(k+1) = solution
    Increase μ_k, decrease τ_k

The effect of $\mu$: 3D landscape

Example: $\min\; x + y$ subject to $x^2 + y^2 - 1 = 0$.

$\mu = 1/2$ (easy to optimize)

$\mu = 50$ (ill-conditioned)

With small $\mu$, the surface is a gentle bowl — easy to optimize but the minimum is far from the constraint circle. With large $\mu$, the surface forms a steep-walled trough along the circle — the minimum is close to the constraint, but the landscape is highly ill-conditioned.

Contour view

$\mu = 1/2$

$\mu = 50$

The dashed circle is the constraint $x^2 + y^2 = 1$. The star marks the true constrained solution at $(-1/\sqrt{2}, -1/\sqrt{2})$.

Convergence theorems

Theorem 17.1 (Paraphrased)

If we use the global minimizer of each penalized subproblem, then as $\mu_k \to \infty$ the solutions converge to a solution of the constrained problem.

Theorem 17.2 (Paraphrased)

If we approximately minimize each subproblem (to gradient norm $\|\mathbf{g}(\mathbf{x}_k)\| \le \tau_k$ with $\tau_k \to 0$), then a limit point of the sequence is either:

An infeasible stationary point of $\|\mathbf{c}(\mathbf{x})\|^2$ (stuck trying to satisfy constraints), or
A KKT point of the original problem

Note: The convergence is about limit points of the sequence, not about any particular iterate. And $\mathbf{c}(\mathbf{x})^T\mathbf{c}(\mathbf{x}) = 0$ vs. $\mathbf{c}(\mathbf{x}) = 0$ — the penalty uses a weaker condition that can have spurious stationary points.

Weaknesses of penalty methods

Problem 1: Ill-conditioning. As $\mu_k \to \infty$, the Hessian of the penalized objective becomes increasingly ill-conditioned, making the subproblems harder to solve.

Problem 2: All constraints are not equal!

The penalty method applies the same $\mu$ to every constraint. But some constraints interact more strongly with the objective than others.

A chain of nodes hangs under gravity, fixed at two endpoints. Each link should have length 1 (constraints). We minimize total height (sum of $y$-coordinates) plus penalty $\frac{\mu}{2}\sum_i (\|\mathbf{p}_{i+1} - \mathbf{p}_i\|^2 - 1)^2$.

Watch how the links near the fixed endpoints stretch more — the penalty treats all constraints equally, but these constraints "want" to violate more because they bear the most tension.

Link length deviations

Augmented Lagrangian methods

The fix: instead of just penalizing, also estimate Lagrange multipliers $\lambda$ for each constraint:

$$\mathcal{L}(\mathbf{x}; \lambda, \mu) = f(\mathbf{x}) - \lambda^T\mathbf{c}(\mathbf{x}) + \frac{\mu}{2}\|\mathbf{c}(\mathbf{x})\|^2$$

If we minimize in $\mathbf{x}$ alone, the stationarity condition is:

$$\nabla_\mathbf{x} \mathcal{L} = \mathbf{g}_f(\mathbf{x}) - \mathbf{J}_c(\mathbf{x})^T(\lambda - \mu\,\mathbf{c}(\mathbf{x})) = 0$$

Compare with the KKT condition for the original problem:

$$\mathbf{g}_f(\mathbf{x}^*) - \mathbf{J}_c(\mathbf{x}^*)^T\lambda^* = 0$$

At a solution where $\mathbf{c}(\mathbf{x}^*) = 0$, these match if $\lambda = \lambda^*$. This suggests the multiplier update:

$$\lambda_{k+1} = \lambda_k - \mu_k\,\mathbf{c}(\mathbf{x}_k)$$

The augmented Lagrangian algorithm

Initialize λ₀, μ₀, x₀
Repeat:
    Solve min_x ℒ(x; λ_k, μ_k) // to tolerance τ_k, starting from x_k
    If ||c(x)|| is small: stop!
    Else:
        λ_k+1 = λ_k − μ_k c(x_k)
        μ_k+1 ≥ μ_k // optionally increase

Key advantage: With good multiplier estimates, convergence can be achieved for a finite value of $\mu$ — unlike the pure penalty method which needs $\mu \to \infty$. This bounds the ill-conditioning.

The Lagrange multipliers $\lambda_k$ allow the method to weight different constraints differently, directly addressing the "hanging net" problem.

Convergence of augmented Lagrangian

Recall the KKT conditions for the equality-constrained problem $\min f(\mathbf{x})$ s.t. $\mathbf{c}(\mathbf{x}) = 0$:

KKT conditions

$\mathbf{c}(\mathbf{x}^*) = 0$ (feasibility)
$\nabla f(\mathbf{x}^*) - \mathbf{J}_c(\mathbf{x}^*)^T \lambda^* = 0$ (stationarity — gradient of Lagrangian vanishes)

Theorem 17.5 (Paraphrased)

Suppose $(\mathbf{x}^*, \lambda^*)$ is a KKT point satisfying second-order sufficient conditions, and the constraint Jacobian $\mathbf{J}_c(\mathbf{x}^*)$ has full rank. Then for $\mu$ sufficiently large, $\mathbf{x}^*$ is a strict local minimizer of $\mathcal{L}(\mathbf{x}; \lambda^*, \mu)$.

This means if we know the true multipliers, we can solve the augmented Lagrangian subproblem for a finite $\mu$.

Theorem 17.6 (Paraphrased)

Under the same conditions, the augmented Lagrangian algorithm converges: the multiplier estimates $\lambda_k \to \lambda^*$, and the iterates $\mathbf{x}_k \to \mathbf{x}^*$, for $\mu$ bounded away from zero.

LANCELOT (Conn, Gould & Toint) is a well-known implementation of augmented Lagrangian methods, specifically designed for problems with bound constraints $\ell \le \mathbf{x} \le \mathbf{u}$. It uses Algorithm 17.4 from Nocedal & Wright, solving each subproblem with a trust-region method that respects bounds.

Barrier methods

For inequality constraints $\mathbf{d}(\mathbf{x}) \ge 0$, barrier (or interior-point) methods add a logarithmic penalty that prevents iterates from leaving the feasible region:

$$\min_{\mathbf{x}}\; f(\mathbf{x}) - \mu \sum_i \log\big(d_i(\mathbf{x})\big)$$

As $\mu \to 0$, the barrier term weakens and the solution approaches the constrained optimum. Unlike penalty methods, iterates stay feasible throughout.

Key idea: The log barrier goes to $+\infty$ as any $d_i(\mathbf{x}) \to 0^+$, creating an invisible wall at the constraint boundary. This naturally converts inequality constraints into an unconstrained problem — but only works when the feasible set has a non-empty interior.

A common practical approach combines barrier methods with equality constraint handling:

Convert inequalities $\mathbf{d}(\mathbf{x}) \ge 0$ into equalities plus bounds: introduce slack variables $\mathbf{s} \ge 0$ with $\mathbf{d}(\mathbf{x}) - \mathbf{s} = 0$
Apply a log barrier to the bounds: $-\mu \sum_i \log(s_i)$
Handle the remaining equalities with augmented Lagrangian or Newton's method

Reference: Chapter 16, Griva, Sofer & Nash.

Other approaches

Method	Key Idea
SQP	Sequential Quadratic Programming: at each step, solve a QP that approximates the NLP locally. Combines a quadratic model of the Lagrangian with linearized constraints. SNOPT is a well-known implementation.
Gradient projection	Project the gradient step onto the feasible set. Natural and efficient for bound constraints ($\ell \le \mathbf{x} \le \mathbf{u}$), where projection is just clamping.