Matrix Calculus via Differentials

A cleaner approach to matrix derivatives, based on Thomas Minka's Old and New Matrix Algebra Useful for Statistics (2000).

The Problem with Partial Derivatives

In Part 2a, we computed matrix derivatives by grinding out partial derivatives $\frac{\partial f}{\partial X_{ij}}$ one entry at a time. This works, but it's painful and error-prone, especially for matrix-valued expressions.

Minka's key insight: instead of computing partial derivatives directly, work with differentials and then identify the derivative at the end.

The differential approach:
1. Take the differential $df$ of both sides
2. Apply simple rules to manipulate $df$ into a standard form
3. Read off the derivative by pattern matching

This is analogous to how we work with derivatives of scalar functions: instead of going back to limits every time, we use rules like $(fg)' = f'g + fg'$ mechanically. The differential approach gives us equally clean rules for matrices.

What is a Derivative?

The derivative $\frac{dy}{dx}$ depends on what types $y$ and $x$ are. Here's the full picture:

	$x$ is Scalar	$\mathbf{x}$ is Vector	$\mathbf{X}$ is Matrix
$y$ scalar	$\frac{dy}{dx}$ scalar	$\frac{dy}{d\mathbf{x}} = \begin{bmatrix}\frac{\partial y}{\partial x_j}\end{bmatrix}$ row vector	$\frac{dy}{d\mathbf{X}} = \begin{bmatrix}\frac{\partial y}{\partial x_{ji}}\end{bmatrix}$ matrix
$\mathbf{y}$ vector	$\frac{d\mathbf{y}}{dx} = \begin{bmatrix}\frac{\partial y_i}{dx}\end{bmatrix}$ column vector	$\frac{d\mathbf{y}}{d\mathbf{x}} = \begin{bmatrix}\frac{\partial y_i}{\partial x_j}\end{bmatrix}$ matrix (Jacobian)
$\mathbf{Y}$ matrix	$\frac{d\mathbf{Y}}{dx} = \begin{bmatrix}\frac{\partial y_{ij}}{dx}\end{bmatrix}$ matrix

Notice the transpose! For a scalar $y$ and matrix $\mathbf{X}$, the $(i,j)$ entry of $\frac{dy}{d\mathbf{X}}$ is $\frac{\partial y}{\partial x_{ji}}$ -- not $\frac{\partial y}{\partial x_{ij}}$. This convention ensures that the chain rule works cleanly without extra transposes. It also means $\frac{dy}{d\mathbf{x}}$ is a row vector (the transpose of what we called the gradient in Part 2a).

Convention note: In Minka's notation, $\frac{dy}{d\mathbf{x}}$ is a row vector. Our gradient $\nabla f$ from Part 2a is the transpose: $\nabla f(x) = \left(\frac{df}{d\mathbf{x}}\right)^T$. This page follows Minka's convention; just remember to transpose at the end if you need a column gradient.

The Differential Rules

The power of the differential approach is that the rules are simple, mechanical, and look almost identical to scalar calculus. Here are all the rules you need:

Basic Rules

$d\mathbf{A} = 0$

constant matrix

$d(\alpha \mathbf{X}) = \alpha \, d\mathbf{X}$

scalar multiple

$d(\mathbf{X} + \mathbf{Y}) = d\mathbf{X} + d\mathbf{Y}$

sum

$d(\mathrm{tr}(\mathbf{X})) = \mathrm{tr}(d\mathbf{X})$

trace

Product Rules

$d(\mathbf{X}\mathbf{Y}) = (d\mathbf{X})\mathbf{Y} + \mathbf{X}(d\mathbf{Y})$

matrix product

$d(\mathbf{X} \otimes \mathbf{Y}) = (d\mathbf{X}) \otimes \mathbf{Y} + \mathbf{X} \otimes (d\mathbf{Y})$

Kronecker product

$d(\mathbf{X} \circ \mathbf{Y}) = (d\mathbf{X}) \circ \mathbf{Y} + \mathbf{X} \circ (d\mathbf{Y})$

Hadamard (elementwise) product

Inverse, Determinant, and Log-Determinant

$d\mathbf{X}^{-1} = -\mathbf{X}^{-1}(d\mathbf{X})\mathbf{X}^{-1}$

inverse

$d|\mathbf{X}| = |\mathbf{X}|\,\mathrm{tr}(\mathbf{X}^{-1} d\mathbf{X})$

determinant

$d\log|\mathbf{X}| = \mathrm{tr}(\mathbf{X}^{-1} d\mathbf{X})$

log-determinant

Other Useful Rules

$d\mathbf{X}^T = (d\mathbf{X})^T$

transpose

$d\mathbf{X}^* = (d\mathbf{X})^*$

conjugate transpose

These are essentially the same rules as single-variable calculus -- product rule, chain rule -- but now applied to matrices. The crucial thing is that order matters (matrix multiplication doesn't commute), so you can't rearrange factors freely.

The Identification Step

After applying the differential rules, we'll have $df$ expressed in terms of $d\mathbf{X}$. To read off the derivative, we use pattern matching.

Scalar function of a matrix: the key identity

If you can write: $$df = \mathrm{tr}(\mathbf{A}^T \, d\mathbf{X})$$ then the derivative is: $$\frac{df}{d\mathbf{X}} = \mathbf{A}$$

Why does this work? Because $\mathrm{tr}(\mathbf{A}^T d\mathbf{X}) = \sum_{i,j} A_{ij} \, dX_{ij}$. Comparing with $df = \sum_{i,j} \frac{\partial f}{\partial X_{ji}} dX_{ij}$ (using Minka's transposed convention), we identify $A_{ij} = \frac{\partial f}{\partial X_{ji}}$, giving $\frac{df}{d\mathbf{X}} = \mathbf{A}$.

For vectors: If $df = \mathbf{a}^T d\mathbf{x}$, then $\frac{df}{d\mathbf{x}} = \mathbf{a}^T$ (a row vector). The gradient (column vector) is $\nabla f = \mathbf{a}$.

Useful trace identities for rearranging into the right form:
$\mathrm{tr}(\mathbf{A}\mathbf{B}) = \mathrm{tr}(\mathbf{B}\mathbf{A})$ — cyclic property
$\mathrm{tr}(\mathbf{A}^T) = \mathrm{tr}(\mathbf{A})$
$\mathrm{tr}(\mathbf{A}) = \mathrm{tr}(\mathbf{A}^T)$
$\mathbf{a}^T \mathbf{b} = \mathrm{tr}(\mathbf{a}^T \mathbf{b}) = \mathrm{tr}(\mathbf{b}\mathbf{a}^T)$ — scalar = its own trace

Warm-up: $f(\mathbf{x}) = \mathbf{a}^T \mathbf{x}$

Let's start simple to see the method in action.

Step 1: Take the differential

$df = d(\mathbf{a}^T \mathbf{x})$. Since $\mathbf{a}$ is constant, $d\mathbf{a} = 0$, so by the product rule:
$df = \mathbf{a}^T d\mathbf{x}$

Step 2: Identify

This is already in the form $df = \mathbf{a}^T d\mathbf{x}$, so $\frac{df}{d\mathbf{x}} = \mathbf{a}^T$ and $\nabla f = \mathbf{a}$.

That was almost too easy. The differential approach really shines on harder problems.

Example: $f(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x}$

Step 1: Take the differential

$df = d(\mathbf{x}^T \mathbf{A} \mathbf{x})$. Apply the product rule to $\mathbf{x}^T$ times $\mathbf{A}\mathbf{x}$:
$df = (d\mathbf{x}^T) \mathbf{A} \mathbf{x} + \mathbf{x}^T \mathbf{A} (d\mathbf{x})$
(here $\mathbf{A}$ is constant so $d\mathbf{A} = 0$, and we used the product rule on the three-factor product)

Step 2: Simplify using transposes

The first term: $(d\mathbf{x}^T)\mathbf{A}\mathbf{x} = (d\mathbf{x})^T \mathbf{A}\mathbf{x}$. This is a scalar, so it equals its transpose:
$(d\mathbf{x})^T \mathbf{A}\mathbf{x} = (\mathbf{A}\mathbf{x})^T d\mathbf{x} = \mathbf{x}^T \mathbf{A}^T d\mathbf{x}$
So: $df = \mathbf{x}^T \mathbf{A}^T d\mathbf{x} + \mathbf{x}^T \mathbf{A} \, d\mathbf{x} = \mathbf{x}^T(\mathbf{A}^T + \mathbf{A}) d\mathbf{x}$

Step 3: Identify

$df = \mathbf{x}^T(\mathbf{A}+\mathbf{A}^T) d\mathbf{x}$ matches $df = \mathbf{a}^T d\mathbf{x}$ with $\mathbf{a} = (\mathbf{A}+\mathbf{A}^T)\mathbf{x}$.
So $\nabla f = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}$. For symmetric $\mathbf{A}$: $\nabla f = 2\mathbf{A}\mathbf{x}$.

The Main Event: $f(\mathbf{X}) = \log|\mathbf{X}|$

This is where the differential approach really shines compared to the element-by-element approach we used in Part 2a. In Part 2a, we needed the identity $\det(\mathbf{X} + \epsilon \mathbf{E}_{ij}) = \det(\mathbf{X})(1 + \epsilon [\mathbf{X}^{-1}]_{ji} + \cdots)$ and several steps of reasoning. Watch how clean the differential version is.

We'll use just two rules from our table -- the determinant rule and the scalar chain rule -- and build the result from scratch.

Step 1: Decompose with the scalar chain rule

Write $f = \log(g)$ where $g = |\mathbf{X}|$ (the determinant). The scalar chain rule for differentials says $d(\log(g)) = \frac{1}{g}\,dg$, so:
$$df = \frac{1}{|\mathbf{X}|} \, d|\mathbf{X}|$$

Step 2: Apply the determinant rule

From our table: $d|\mathbf{X}| = |\mathbf{X}| \, \mathrm{tr}(\mathbf{X}^{-1} d\mathbf{X})$. Substituting: $$df = \frac{1}{|\mathbf{X}|} \cdot |\mathbf{X}| \, \mathrm{tr}(\mathbf{X}^{-1} d\mathbf{X})$$

Step 3: Simplify -- a beautiful cancellation

The $|\mathbf{X}|$ factors cancel perfectly: $$df = \mathrm{tr}(\mathbf{X}^{-1} d\mathbf{X})$$ This is why $\log\det$ is so much nicer to work with than $\det$ alone -- the log absorbs the scaling factor.

Step 4: Identify the derivative

We need the form $df = \mathrm{tr}(\mathbf{A}^T d\mathbf{X})$. We have $df = \mathrm{tr}(\mathbf{X}^{-1} d\mathbf{X})$.
So $\mathbf{A}^T = \mathbf{X}^{-1}$, which means $\mathbf{A} = \mathbf{X}^{-T}$.

Result

$$\frac{d\log|\mathbf{X}|}{d\mathbf{X}} = \mathbf{X}^{-T}$$ Four short steps, each completely mechanical. Compare to the derivation in Part 2a!
For symmetric $\mathbf{X}$: $\mathbf{X}^{-T} = \mathbf{X}^{-1}$, so the derivative is simply $\mathbf{X}^{-1}$.

Notice the pattern. We didn't need a special rule for $\log|\mathbf{X}|$ -- we built it from the determinant rule and the scalar chain rule. The differential approach lets you compose rules freely, just like in single-variable calculus. The $\log|\mathbf{X}|$ entry in the table is a convenience, not a necessity.

Example: $f(\mathbf{X}) = \mathrm{tr}(\mathbf{A}\mathbf{X}^{-1}\mathbf{B})$

This pattern arises in Gaussian likelihoods. Let's see the differential approach handle it.

Step 1: Take the differential

$df = d\,\mathrm{tr}(\mathbf{A}\mathbf{X}^{-1}\mathbf{B}) = \mathrm{tr}(d(\mathbf{A}\mathbf{X}^{-1}\mathbf{B}))$
Since $\mathbf{A}$ and $\mathbf{B}$ are constant: $= \mathrm{tr}(\mathbf{A} \, d(\mathbf{X}^{-1}) \, \mathbf{B})$

Step 2: Apply the inverse rule

$d\mathbf{X}^{-1} = -\mathbf{X}^{-1}(d\mathbf{X})\mathbf{X}^{-1}$
So: $df = \mathrm{tr}(\mathbf{A}(-\mathbf{X}^{-1} d\mathbf{X} \, \mathbf{X}^{-1})\mathbf{B}) = -\mathrm{tr}(\mathbf{A}\mathbf{X}^{-1} d\mathbf{X} \, \mathbf{X}^{-1}\mathbf{B})$

Step 3: Cyclic property of trace

Use $\mathrm{tr}(\mathbf{P}\mathbf{Q}\mathbf{R}\mathbf{S}) = \mathrm{tr}(\mathbf{S}\mathbf{P}\mathbf{Q}\mathbf{R})$ to cycle $d\mathbf{X}$ to the end:
$df = -\mathrm{tr}(\mathbf{X}^{-1}\mathbf{B}\mathbf{A}\mathbf{X}^{-1} \, d\mathbf{X})$

Step 4: Identify

$df = \mathrm{tr}(\mathbf{A}^T d\mathbf{X})$ with $\mathbf{A}^T = -\mathbf{X}^{-1}\mathbf{B}\mathbf{A}\mathbf{X}^{-1}$.
So $\frac{df}{d\mathbf{X}} = \mathbf{A} = -(\mathbf{X}^{-1}\mathbf{B}\mathbf{A}\mathbf{X}^{-1})^T = -\mathbf{X}^{-T}\mathbf{A}^T\mathbf{B}^T\mathbf{X}^{-T}$

Example: $f(\mathbf{x}) = (\mathbf{A}\mathbf{x} - \mathbf{b})^T \mathbf{W} (\mathbf{A}\mathbf{x} - \mathbf{b})$

Weighted least squares -- find the gradient with respect to $\mathbf{x}$, where $\mathbf{W}$ is symmetric positive definite.

Step 1: Let $\mathbf{r} = \mathbf{A}\mathbf{x} - \mathbf{b}$

$f = \mathbf{r}^T \mathbf{W} \mathbf{r}$, and $d\mathbf{r} = \mathbf{A}\,d\mathbf{x}$ (since $\mathbf{b}$ is constant).

Step 2: Differentiate the quadratic form

$df = (d\mathbf{r})^T \mathbf{W}\mathbf{r} + \mathbf{r}^T \mathbf{W} (d\mathbf{r})$
Both terms are scalars. The first term transposed: $(d\mathbf{r})^T\mathbf{W}\mathbf{r} = \mathbf{r}^T\mathbf{W}^T d\mathbf{r} = \mathbf{r}^T\mathbf{W}\,d\mathbf{r}$ (since $\mathbf{W} = \mathbf{W}^T$).
So: $df = 2\mathbf{r}^T \mathbf{W} \, d\mathbf{r}$

Step 3: Substitute $d\mathbf{r} = \mathbf{A}\,d\mathbf{x}$

$df = 2\mathbf{r}^T \mathbf{W} \mathbf{A} \, d\mathbf{x} = 2(\mathbf{A}\mathbf{x} - \mathbf{b})^T \mathbf{W} \mathbf{A} \, d\mathbf{x}$

Step 4: Identify

$df = \mathbf{c}^T d\mathbf{x}$ where $\mathbf{c} = 2\mathbf{A}^T\mathbf{W}(\mathbf{A}\mathbf{x} - \mathbf{b})$.
$$\nabla f = 2\mathbf{A}^T\mathbf{W}(\mathbf{A}\mathbf{x} - \mathbf{b})$$ Setting to zero gives the normal equations: $\mathbf{A}^T\mathbf{W}\mathbf{A}\mathbf{x} = \mathbf{A}^T\mathbf{W}\mathbf{b}$.

The Differential Method: Summary

Recipe for any matrix derivative:

1. Take the differential $df$ using the rules: product rule, chain rule, plus the table entries for inverse, determinant, trace, etc.

2. Rearrange into standard form using trace identities (cyclic property, transpose inside trace).

3. Identify the derivative:
• Scalar $f$, vector $\mathbf{x}$: match $df = \mathbf{a}^T d\mathbf{x}$, then $\nabla f = \mathbf{a}$
• Scalar $f$, matrix $\mathbf{X}$: match $df = \mathrm{tr}(\mathbf{A}^T d\mathbf{X})$, then $\frac{df}{d\mathbf{X}} = \mathbf{A}$

4. Check with finite differences!

← Part 2a: Direct Rules Next: Automatic Differentiation →