Regularized linear regression as MAP


Suppose we have input vectors \(\{ x_i \}_{i = 1}^n\) and corresponding scalar targets \(\{ y_i \}_{i = 1}^n\), where each input vector \(x_i \in \mathbb{R}^d\). We can arrange the inputs as a matrix \(X\) and the targets as a vector \(y\):

\(\displaystyle X = \left( \begin{array}{ccc} {-\!\!-} & x_1^T & {-\!\!-}\\ & \vdots & \\ {-\!\!-} & x_n^T & {-\!\!-} \end{array} \right) \in \mathbb{R}^{n \times d} \quad \text{ and } \quad y = \left( \begin{array}{c} y_1\\ \vdots\\ y_n \end{array} \right) \in \mathbb{R}^n\)

Assume the conditional distribution \(y_i |x_i, w\) is normal with mean \(x_i^T w\) and known variance \(\sigma^2\). Under i.i.d assumptions on the data the likelihood is:

\(\displaystyle p (y|X, w) =\mathcal{N} (y|Xw, \sigma^2 I)\)

We will also assume an isotropic gaussian prior on the weights with known variance \(s^2\):

\(\displaystyle p (w) =\mathcal{N} (w| 0, s^2 I)\)

We can use Bayes rule to obtain the posterior distribution over the weights:

\(\displaystyle p (w|X, y) = \frac{p (y|X, w) p (w)}{\int p (y|X, w') p (w') d w'}\)

It turns out that the posterior is actually normal, with mean and covariance we can solve for exactly, though the algebra is a bit messy. Instead of trying to find the whole posterior distribution, here we just want the MAP estimate, the single value of \(w\) that maximizes the log density:

\(\displaystyle \arg \max_w F (w) = \log p (w|X, y)\)

We can expand the MAP objective:

\begin{eqnarray*} F (w) & \propto & \log p (y|X, w) + \log p (w)\\ & = & \log \mathcal{N} (y|Xw, \sigma^2 I) + \log \mathcal{N} (w| 0, s^2 I)\\ & = & - \frac{1}{2} [(y - Xw)^T (\sigma^2 I)^{- 1} (y - Xw)] - \frac{1}{2} [w^T (s^2 I)^{- 1} w] + C\\ & = & - \frac{1}{2 \sigma^2} \| y - Xw \|^2 - \frac{1}{2 s^2} \| w \|^2 + C \end{eqnarray*}

Dropping the constant and multiplying the objective by \(- 2 \sigma^2\) gives us a minimization problem:

\(\displaystyle \arg \min_w L (w) = \| y - Xw \|^2 + \left( \frac{\sigma^2}{s^2} \right) \| w \|^2\)

This objective is identical to L2-regularized least squares with regularization constant \(\lambda = \left( \frac{\sigma^2}{s^2} \right)\).