# Kernels: Representer theorem

## 1Setup

Consider a supervised learning problem with $$m$$ training data points $$\{ (x^i, y^i) \}_{i = 1}^m$$. Assume we have a feature map $$\phi$$ of our inputs, where for any input $$x$$ the feature map produces $$\phi (x) \in \mathbb{R}^n$$. Then we can arrange our featurized training data in a $$n \times m$$ matrix and the target data $$y^i$$ into a vector:

$$\displaystyle \Phi = \left( \begin{array}{ccc} | & & |\\ x^1 & \ldots & x^m\\ | & & | \end{array} \right), \quad \boldsymbol{y} = \left( \begin{array}{c} y^1\\ \vdots\\ y^m \end{array} \right)$$

As an example, suppose we are doing regularized linear regression, so our model's output is $$\hat{y} (x) = \phi (x)^T w$$ for some learned parameters $$w \in \mathbb{R}^n$$. The objective is to find $$w$$ that minimizes:

\begin{eqnarray*} J^{\operatorname{LS}} (w) & = & \frac{1}{m} \sum_{i = 1}^m \| \phi (x^i)^T w - y^i \|^2 + \lambda \| w \|^2\\ & = & \frac{1}{m} \| \Phi^T w - \boldsymbol{y} \|^2 + \lambda \| w \|^2 \end{eqnarray*}

If we define the loss function $$L (\boldsymbol{\hat{y}}, \boldsymbol{y}) = \frac{1}{m} \| \boldsymbol{\widehat{\boldsymbol{y}}} - y \|^2$$ (mean squared error), then we could rewrite the objective more generically:

 $$\displaystyle J (w) = L (\Phi^T w, \boldsymbol{y}) + \lambda \| w \|^2$$ (1)

By choosing different loss functions $$L$$, we can end up with different methods besides linear regression. For example, SVM training would use a hinge loss. We will use this more general objective $$J (w)$$ from now on, without being too specific about which loss function $$L$$ is.

## 2Representer theorem

The representer theorem tells us that there is a minimizer of $$J$$ that is in the span of the (featurized) training data. This is the same as saying there is a minimizer in $$\operatorname{im} (\Phi)$$, the column span of $$\Phi$$. Given some $$w^{\star} \in \mathbb{R}^n$$ that is a minimizer of $$J (w)$$, there is a $$\tilde{w} \in \operatorname{im} (\Phi)$$ such that $$J (\tilde{w}) \leqslant J (w^{\star})$$. So $$\tilde{w}$$ is also a minimizer of $$J$$, and $$\tilde{w}$$ is in the span of the featurized training data.

To show this, recall that results from linear algebra give us the direct sum decomposition $$\mathbb{R}^n =\operatorname{im} (\Phi) \oplus \ker (\Phi^T)$$, and the two subspaces are orthogonal complements. Since $$w^{\star} \in \mathbb{R}^n$$, we can decompose it:

$$\displaystyle w^{\star} = \tilde{w} + u, \quad \tilde{w} \in \operatorname{im} (\Phi), u \in \ker (\Phi^T)$$

Plugging this into $$J$$:

$$\displaystyle J (w^{\star}) = J (\tilde{w} + u) = L (\Phi^T (\tilde{w} + u), \boldsymbol{y}) + \lambda \| \tilde{w} + u \|^2$$

Since we know $$\Phi^T u = 0$$ and that $$\tilde{w}^T u = 0$$, this simplifies:

$$\displaystyle J (w^{\star}) = L (\Phi^T \tilde{w}, \boldsymbol{y}) + \lambda (\| \tilde{w} \|^2 + \| u \|^2) \geqslant L (\Phi^T \tilde{w}, \boldsymbol{y}) + \lambda \| \tilde{w} \|^2 = J (\tilde{w})$$

## 3The kernel trick

Given the above results, we may choose to minimize $$J (w)$$ exclusively over $$w \in \operatorname{im} (\Phi)$$. This means we can write $$w = \Phi v$$ for some $$v \in \mathbb{R}^m$$, and substitute to have $$J$$ as a function of $$v$$:

 $$\displaystyle J (v) = L (\Phi^T \Phi v, \boldsymbol{y}) + \lambda \| \Phi v \|^2 = L (Kv, \boldsymbol{y}) + \lambda v^T Kv$$ (2)

where $$K = \Phi^T \Phi \in \mathbb{R}^{m \times m}$$.

In some situations our feature dimension $$n$$ is much larger than the number of training data points $$m$$. Assuming $$n \gg m$$, optimizing over $$v \in \mathbb{R}^m$$ (Eq 2) can be much more computationally convenient than optimizing over $$w \in \mathbb{R}^n$$ (Eq 1).

Additionally, if $$n$$ is large then dealing with $$\phi (x) \in \mathbb{R}^n$$ is troublesome. Fortunately, if we precompute the matrix $$K = \Phi^T \Phi$$ then we never again have to deal with $$\phi$$ directly during training, since Eq. 2 is only in terms of $$K$$. And for certain choices of $$\phi$$, we can compute $$K$$ without actually having to evaluate $$\phi (x)$$ either, which is the essence of the kernel trick. Notice that:

 $$\displaystyle K_{i j} = \phi (x^i)^T \phi (x^j)$$ (3)

We look for a kernel function $$\boldsymbol{K}$$ such that $$\boldsymbol{K} (x^i, x^j) = \phi (x^i)^T \phi (x^j)$$. If evaluating $$\boldsymbol{K}$$ is cheaper than evaluating $$\phi (x)$$ and then taking dot products, we can just use $$\boldsymbol{K}$$ to form $$K$$ instead.

Suppose for example that each $$x^i \in \mathbb{R}^2$$ and $$\phi$$ produces quadratic features:

$$\displaystyle \phi ((x_1, x_2)^T) = \left( 1, \sqrt{2} x_1, \sqrt{2} x_2, x_1^2, \sqrt{2} x_1 x_2, x_2^2 \right)^T$$

Now if we define $$\boldsymbol{K} (x, z) = (1 + x^T z)^2$$ for any $$x, z \in \mathbb{R}^2$$, we can check by expanding that

$$\displaystyle \boldsymbol{K} (x, z) = \phi (x)^T \phi (z)$$

Notice that computing $$\boldsymbol{K} (x, z) = (1 + x^T z)^2$$ means taking a dot product between vectors in $$\mathbb{R}^2$$, while computing $$\phi (x)^T \phi (z)$$ naively requires forming and taking the dot product between two vectors in $$\mathbb{R}^6$$.

In general, if a feature map $$\phi$$ has a corresponding kernel function $$\boldsymbol{K}$$ satisfying $$\boldsymbol{K} (x, z) = \phi (x)^T \phi (z)$$ then we have the option of forming $$K = \Phi^T \Phi$$ without ever using $$\phi$$, by setting:

$$\displaystyle K_{i j} = \boldsymbol{K} (x^i, x^j)$$

This tends to be useful when $$n$$ (the dimension of $$\phi (x)$$) is extremely large, and there is an equivalent kernel $$\boldsymbol{K}$$ which makes evaluating $$\phi (x)^T \phi (z)$$ feasible.

## 1References

Slides by Laurent El Ghaoui (PDF link)