跳转至

线性回归

分析解

Unlike most of the models that we will cover,

linear regression presents us with

a surprisingly easy optimization problem.

In particular, we can find the optimal parameters

(as assessed on the training data)

analytically by applying a simple formula as follows.

First, we can subsume the bias \(b\) into the parameter \(\mathbf{w}\)

by appending a column to the design matrix consisting of all 1s.

Then our prediction problem is to minimize \(\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2\).

As long as the design matrix \(\mathbf{X}\) has full rank

(no feature is linearly dependent on the others),

then there will be just one critical point on the loss surface

and it corresponds to the minimum of the loss over the entire domain.

Taking the derivative of the loss with respect to \(\mathbf{w}\)

and setting it equal to zero yields:

$$\begin{aligned}

\partial_{\mathbf{w}} |\mathbf{y} - \mathbf{X}\mathbf{w}|^2 =

2 \mathbf{X}^\top (\mathbf{X} \mathbf{w} - \mathbf{y}) = 0

\textrm{ and hence }

\mathbf{X}^\top \mathbf{y} = \mathbf{X}^\top \mathbf{X} \mathbf{w}.

\end{aligned}$$

Solving for \(\mathbf{w}\) provides us with the optimal solution

for the optimization problem.

Note that this solution

\[\mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf{y}\]

will only be unique

when the matrix \(\mathbf X^\top \mathbf X\) is invertible,

i.e., when the columns of the design matrix

are linearly independent :cite:Golub.Van-Loan.1996.

线性回归和高斯分布

One way to motivate linear regression with squared loss

is to assume that observations arise from noisy measurements,

where the noise \(\epsilon\) follows the normal distribution

\(\mathcal{N}(0, \sigma^2)\):

\[y = \mathbf{w}^\top \mathbf{x} + b + \epsilon \textrm{ where } \epsilon \sim \mathcal{N}(0, \sigma^2).\]

Thus, we can now write out the *likelihood*

of seeing a particular \(y\) for a given \(\mathbf{x}\) via

\[P(y \mid \mathbf{x}) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (y - \mathbf{w}^\top \mathbf{x} - b)^2\right).\]

As such, the likelihood factorizes.

According to *the principle of maximum likelihood*,

the best values of parameters \(\mathbf{w}\) and \(b\) are those

that maximize the *likelihood* of the entire dataset:

\[P(\mathbf y \mid \mathbf X) = \prod_{i=1}^{n} p(y^{(i)} \mid \mathbf{x}^{(i)}).\]

The equality follows since all pairs \((\mathbf{x}^{(i)}, y^{(i)})\)

were drawn independently of each other.

Estimators chosen according to the principle of maximum likelihood

are called *maximum likelihood estimators*.

While, maximizing the product of many exponential functions,

might look difficult,

we can simplify things significantly, without changing the objective,

by maximizing the logarithm of the likelihood instead.

For historical reasons, optimizations are more often expressed

as minimization rather than maximization.

So, without changing anything,

we can *minimize the negative log-likelihood*,

which we can express as follows:

\[-\log P(\mathbf y \mid \mathbf X) = \sum_{i=1}^n \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2 \sigma^2} \left(y^{(i)} - \mathbf{w}^\top \mathbf{x}^{(i)} - b\right)^2.\]

If we assume that \(\sigma\) is fixed,

we can ignore the first term,

because it does not depend on \(\mathbf{w}\) or \(b\).

The second term is identical

to the squared error loss introduced earlier,

except for the multiplicative constant \(\frac{1}{\sigma^2}\).

Fortunately, the solution does not depend on \(\sigma\) either.

It follows that minimizing the mean squared error

is equivalent to the maximum likelihood estimation

of a linear model under the assumption of additive Gaussian noise.