Kernels: Representer theorem

2020/11/29

1Setup

Consider a supervised learning problem with \(m\) training data points \(\{ (x^i, y^i) \}_{i = 1}^m\). Assume we have a feature map \(\phi\) of our inputs, where for any input \(x\) the feature map produces \(\phi (x) \in \mathbb{R}^n\). Then we can arrange our featurized training data in a \(n \times m\) matrix and the target data \(y^i\) into a vector:

\(\displaystyle \Phi = \left( \begin{array}{ccc} | & & |\\ x^1 & \ldots & x^m\\ | & & | \end{array} \right), \quad \boldsymbol{y} = \left( \begin{array}{c} y^1\\ \vdots\\ y^m \end{array} \right)\)

As an example, suppose we are doing regularized linear regression, so our model's output is \(\hat{y} (x) = \phi (x)^T w\) for some learned parameters \(w \in \mathbb{R}^n\). The objective is to find \(w\) that minimizes:

\begin{eqnarray*} J^{\operatorname{LS}} (w) & = & \frac{1}{m} \sum_{i = 1}^m \| \phi (x^i)^T w - y^i \|^2 + \lambda \| w \|^2\\ & = & \frac{1}{m} \| \Phi^T w - \boldsymbol{y} \|^2 + \lambda \| w \|^2 \end{eqnarray*}

If we define the loss function \(L (\boldsymbol{\hat{y}}, \boldsymbol{y}) = \frac{1}{m} \| \boldsymbol{\widehat{\boldsymbol{y}}} - y \|^2\) (mean squared error), then we could rewrite the objective more generically:

\(\displaystyle J (w) = L (\Phi^T w, \boldsymbol{y}) + \lambda \| w \|^2\) (1)

By choosing different loss functions \(L\), we can end up with different methods besides linear regression. For example, SVM training would use a hinge loss. We will use this more general objective \(J (w)\) from now on, without being too specific about which loss function \(L\) is.

2Representer theorem

The representer theorem tells us that there is a minimizer of \(J\) that is in the span of the (featurized) training data. This is the same as saying there is a minimizer in \(\operatorname{im} (\Phi)\), the column span of \(\Phi\). Given some \(w^{\star} \in \mathbb{R}^n\) that is a minimizer of \(J (w)\), there is a \(\tilde{w} \in \operatorname{im} (\Phi)\) such that \(J (\tilde{w}) \leqslant J (w^{\star})\). So \(\tilde{w}\) is also a minimizer of \(J\), and \(\tilde{w}\) is in the span of the featurized training data.

To show this, recall that results from linear algebra give us the direct sum decomposition \(\mathbb{R}^n =\operatorname{im} (\Phi) \oplus \ker (\Phi^T)\), and the two subspaces are orthogonal complements. Since \(w^{\star} \in \mathbb{R}^n\), we can decompose it:

\(\displaystyle w^{\star} = \tilde{w} + u, \quad \tilde{w} \in \operatorname{im} (\Phi), u \in \ker (\Phi^T)\)

Plugging this into \(J\):

\(\displaystyle J (w^{\star}) = J (\tilde{w} + u) = L (\Phi^T (\tilde{w} + u), \boldsymbol{y}) + \lambda \| \tilde{w} + u \|^2\)

Since we know \(\Phi^T u = 0\) and that \(\tilde{w}^T u = 0\), this simplifies:

\(\displaystyle J (w^{\star}) = L (\Phi^T \tilde{w}, \boldsymbol{y}) + \lambda (\| \tilde{w} \|^2 + \| u \|^2) \geqslant L (\Phi^T \tilde{w}, \boldsymbol{y}) + \lambda \| \tilde{w} \|^2 = J (\tilde{w})\)

3The kernel trick

Given the above results, we may choose to minimize \(J (w)\) exclusively over \(w \in \operatorname{im} (\Phi)\). This means we can write \(w = \Phi v\) for some \(v \in \mathbb{R}^m\), and substitute to have \(J\) as a function of \(v\):

\(\displaystyle J (v) = L (\Phi^T \Phi v, \boldsymbol{y}) + \lambda \| \Phi v \|^2 = L (Kv, \boldsymbol{y}) + \lambda v^T Kv\) (2)

where \(K = \Phi^T \Phi \in \mathbb{R}^{m \times m}\).

In some situations our feature dimension \(n\) is much larger than the number of training data points \(m\). Assuming \(n \gg m\), optimizing over \(v \in \mathbb{R}^m\) (Eq 2) can be much more computationally convenient than optimizing over \(w \in \mathbb{R}^n\) (Eq 1).

Additionally, if \(n\) is large then dealing with \(\phi (x) \in \mathbb{R}^n\) is troublesome. Fortunately, if we precompute the matrix \(K = \Phi^T \Phi\) then we never again have to deal with \(\phi\) directly during training, since Eq. 2 is only in terms of \(K\). And for certain choices of \(\phi\), we can compute \(K\) without actually having to evaluate \(\phi (x)\) either, which is the essence of the kernel trick. Notice that:

\(\displaystyle K_{i j} = \phi (x^i)^T \phi (x^j)\) (3)

We look for a kernel function \(\boldsymbol{K}\) such that \(\boldsymbol{K} (x^i, x^j) = \phi (x^i)^T \phi (x^j)\). If evaluating \(\boldsymbol{K}\) is cheaper than evaluating \(\phi (x)\) and then taking dot products, we can just use \(\boldsymbol{K}\) to form \(K\) instead.

Suppose for example that each \(x^i \in \mathbb{R}^2\) and \(\phi\) produces quadratic features:

\(\displaystyle \phi ((x_1, x_2)^T) = \left( 1, \sqrt{2} x_1, \sqrt{2} x_2, x_1^2, \sqrt{2} x_1 x_2, x_2^2 \right)^T\)

Now if we define \(\boldsymbol{K} (x, z) = (1 + x^T z)^2\) for any \(x, z \in \mathbb{R}^2\), we can check by expanding that

\(\displaystyle \boldsymbol{K} (x, z) = \phi (x)^T \phi (z)\)

Notice that computing \(\boldsymbol{K} (x, z) = (1 + x^T z)^2\) means taking a dot product between vectors in \(\mathbb{R}^2\), while computing \(\phi (x)^T \phi (z)\) naively requires forming and taking the dot product between two vectors in \(\mathbb{R}^6\).

In general, if a feature map \(\phi\) has a corresponding kernel function \(\boldsymbol{K}\) satisfying \(\boldsymbol{K} (x, z) = \phi (x)^T \phi (z)\) then we have the option of forming \(K = \Phi^T \Phi\) without ever using \(\phi\), by setting:

\(\displaystyle K_{i j} = \boldsymbol{K} (x^i, x^j)\)

This tends to be useful when \(n\) (the dimension of \(\phi (x)\)) is extremely large, and there is an equivalent kernel \(\boldsymbol{K}\) which makes evaluating \(\phi (x)^T \phi (z)\) feasible.

1References

Slides by Laurent El Ghaoui (PDF link)