# Implicit Differentiation for Hyperparameter Optimization

## 2021/02/06

Implicit differentiation is often used for hyperparameter optimization, to avoid backpropagating through an optimization process. A recent example can be found in this paper that applies the technique to learn, among other things, data augmentation that improves generalization. Define the following hyperparameter optimization problem, where $\mathrm{\Xi »}$ denotes hyperparameters and ${\mathrm{\pi }}^{\beta }$ denotes (optimized) parameters:

 ${\mathrm{\Xi »}}^{\beta }=\mathrm{arg}\beta ‘\underset{\mathrm{\Xi »}}{\mathrm{min}\beta ‘}g\left({\mathrm{\pi }}^{\beta }\left(\mathrm{\Xi »}\right),\mathrm{\Xi »}\right)$ (1)

The optimized parameters ${\mathrm{\pi }}^{\beta }$ are output from another optimization, which is the training process. And the training problem should depend on the hyperparameters:

 ${\mathrm{\pi }}^{\beta }\left(\mathrm{\Xi »}\right)=\mathrm{arg}\beta ‘\underset{\mathrm{\pi }}{\mathrm{min}\beta ‘}f\left(\mathrm{\pi },\mathrm{\Xi »}\right)$ (2)

Notice we have separate losses: for example $g$ is the validation loss, and $f$ is the training loss. Now to solve the hyperparameter optimization, assume we are doing something like gradient descent (GD) on $\mathrm{\Xi »}$ with current iterate ${\mathrm{\Xi »}}_{t}$, so with learning rate $\mathrm{\Xi ·}$ the update is:

 ${\mathrm{\Xi »}}_{t+1}={\mathrm{\Xi »}}_{t}\beta \mathrm{\Xi ·}\frac{\mathit{dg}}{\mathit{d\Xi »}}\left({\mathrm{\pi }}^{\beta }\left({\mathrm{\Xi »}}_{t}\right),{\mathrm{\Xi »}}_{t}\right)$

Expanding the derivative via chain rule (and dropping arguments to reduce clutter):

 $\frac{\mathit{dg}}{\mathit{d\Xi »}}=\frac{\mathit{\beta g}}{\mathit{\beta \pi }}\frac{d{\mathrm{\pi }}^{\beta }}{\mathit{d\Xi »}}+\frac{\mathit{\beta g}}{\mathit{\beta \Xi »}}$

The hard part above is computing the term $\frac{d{\mathrm{\pi }}^{\beta }}{\mathit{d\Xi »}}\left({\mathrm{\Xi »}}_{t}\right)$ which requires differentiating through the training optimization (Eq. 2) which produced ${\mathrm{\pi }}^{\beta }$. If the training optimization is also gradient descent, then we could consider the entire GD procedure as one giant forward pass, and backpropagate through the GD updates. However, if producing ${\mathrm{\pi }}^{\beta }$ took thousands of GD steps this backpropagation can be both slow and memory intensive. Instead, the implicit function theorem (IFT) gives us an alternative to backpropagation: since we know ${\mathrm{\pi }}^{\beta }\left({\mathrm{\Xi »}}_{t}\right)$ is an optimum of the training problem, we have:

 $\frac{\mathit{\beta f}}{\mathit{\beta \pi }}\left({\mathrm{\pi }}^{\beta },{\mathrm{\Xi »}}_{t}\right)=0$

Now applying $\frac{d\beta  }{\mathit{d\Xi »}}$ to the front, expanding by chain rule, and re-arranging, we have:

$\begin{array}{llll}\hfill 0& =\frac{d}{\mathit{d\Xi »}}\frac{\mathit{\beta f}}{\mathit{\beta \pi }}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill 0& =\frac{{\mathrm{\beta }}^{2}f}{\mathrm{\beta }{\mathrm{\pi }}^{2}}\frac{d{\mathrm{\pi }}^{\beta }}{\mathit{d\Xi »}}+\frac{{\mathrm{\beta }}^{2}f}{\mathit{d\Xi »d\pi }}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill \frac{d{\mathrm{\pi }}^{\beta }}{\mathit{d\Xi »}}& =\beta {\left[\frac{{\mathrm{\beta }}^{2}f}{\mathrm{\beta }{\mathrm{\pi }}^{2}}\right]}^{\beta 1}\frac{{\mathrm{\beta }}^{2}f}{\mathit{d\Xi »d\pi }}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\end{array}$

We have assumed that the above Hessian will be invertible, and even if thatβs true in practice it can be hard to compute exactly, so people often use approximations instead.

Actually, the above is the second part of IFT, so we have skipped a step. The first part of the implicit function theorem says that if there exists a point $\left({\mathrm{\pi }}^{\beta },{\mathrm{\Xi »}}_{t}\right)$ and a smooth function $h$ satisfying $h\left({\mathrm{\pi }}^{\beta },{\mathrm{\Xi »}}_{t}\right)=0$ and $\frac{\mathit{\beta h}}{\mathit{\beta \pi }}\left({\mathrm{\pi }}^{\beta },{\mathrm{\Xi »}}_{t}\right)$ is invertible, then there exists some differentiable function $y\left(\mathrm{\Xi »}\right)$ from a neighborhood of ${\mathrm{\Xi »}}_{t}$ to a neighborhood of ${\mathrm{\pi }}^{\beta }$ satisfying ${\mathrm{\pi }}^{\beta }=y\left({\mathrm{\Xi »}}_{t}\right)$ and $h\left(y\left(\mathrm{\Xi »}\right),\mathrm{\Xi »}\right)=0$. I.e., around the point $\left({\mathrm{\pi }}^{\beta },{\mathrm{\Xi »}}_{t}\right)$ we can write $\mathrm{\pi }$ as a function $y\left(\mathrm{\Xi »}\right)$, and we can differentiate $y\left(\mathrm{\Xi »}\right)$ implicitly as exampled above. In the calculations above we have simply used $h=\frac{\mathit{\beta f}}{\mathit{\beta \pi }}$ and overloaded notation to write ${\mathrm{\pi }}^{\beta }\left(\mathrm{\Xi »}\right)$ instead of ${\mathrm{\pi }}^{\beta }=y\left(\mathrm{\Xi »}\right)$.