Regularized linear regression as MAP

2020/07/01

Suppose we have input vectors $$\{ x_i \}_{i = 1}^n$$ and corresponding scalar targets $$\{ y_i \}_{i = 1}^n$$, where each input vector $$x_i \in \mathbb{R}^d$$. We can arrange the inputs as a matrix $$X$$ and the targets as a vector $$y$$:

$$\displaystyle X = \left( \begin{array}{ccc} {-\!\!-} & x_1^T & {-\!\!-}\\ & \vdots & \\ {-\!\!-} & x_n^T & {-\!\!-} \end{array} \right) \in \mathbb{R}^{n \times d} \quad \text{ and } \quad y = \left( \begin{array}{c} y_1\\ \vdots\\ y_n \end{array} \right) \in \mathbb{R}^n$$

Assume the conditional distribution $$y_i |x_i, w$$ is normal with mean $$x_i^T w$$ and known variance $$\sigma^2$$. Under i.i.d assumptions on the data the likelihood is:

$$\displaystyle p (y|X, w) =\mathcal{N} (y|Xw, \sigma^2 I)$$

We will also assume an isotropic gaussian prior on the weights with known variance $$s^2$$:

$$\displaystyle p (w) =\mathcal{N} (w| 0, s^2 I)$$

We can use Bayes rule to obtain the posterior distribution over the weights:

$$\displaystyle p (w|X, y) = \frac{p (y|X, w) p (w)}{\int p (y|X, w') p (w') d w'}$$

It turns out that the posterior is actually normal, with mean and covariance we can solve for exactly, though the algebra is a bit messy. Instead of trying to find the whole posterior distribution, here we just want the MAP estimate, the single value of $$w$$ that maximizes the log density:

$$\displaystyle \arg \max_w F (w) = \log p (w|X, y)$$

We can expand the MAP objective:

\begin{eqnarray*} F (w) & \propto & \log p (y|X, w) + \log p (w)\\ & = & \log \mathcal{N} (y|Xw, \sigma^2 I) + \log \mathcal{N} (w| 0, s^2 I)\\ & = & - \frac{1}{2} [(y - Xw)^T (\sigma^2 I)^{- 1} (y - Xw)] - \frac{1}{2} [w^T (s^2 I)^{- 1} w] + C\\ & = & - \frac{1}{2 \sigma^2} \| y - Xw \|^2 - \frac{1}{2 s^2} \| w \|^2 + C \end{eqnarray*}

Dropping the constant and multiplying the objective by $$- 2 \sigma^2$$ gives us a minimization problem:

$$\displaystyle \arg \min_w L (w) = \| y - Xw \|^2 + \left( \frac{\sigma^2}{s^2} \right) \| w \|^2$$

This objective is identical to L2-regularized least squares with regularization constant $$\lambda = \left( \frac{\sigma^2}{s^2} \right)$$.