Skip to content

Linear Least Squares

Problem Statement

Consider a problem of estimating a constant but unknown \(\mathbf{x} \in \mathbb{R}^n\) given \(m\) measurements, \(\mathbf{y} \in \mathbb{R}^m\) where \(m > n\). Assume that the measurement model is linear:

\[ \mathbf{y} = \mathbf{H} \mathbf{x} + \boldsymbol{v} \label{least_squares_main}, \]

where \(\mathbf{H} \in \mathbb{R}^{m \times n}\) is the linear observation matrix and \(\boldsymbol{v} \in \mathbb{R}^m\) is the measurement noise.

The measurement residual is defined as:

\[ \boldsymbol{\epsilon}_{\mathbf{y}} = \mathbf{y} - \mathbf{H} \hat{\mathbf{x}}, \]

where \(\hat{\mathbf{x}}\) is the "best" estimate of \(\mathbf{x}\).

Solving Linear Least Squares

The most probable value of the vector \(\mathbf{x}\) is the vector \(\hat{\mathbf{x}}\) that minimizes the sum of squares between the observed values \(\mathbf{y}\) and the vector \(\mathbf{H} \hat{\mathbf{x}}\). This makes it an optimization problem:

Hence, the least squares problem can be formulated as:

\[ \begin{align} \arg \min_{\hat{\mathbf{x}}} \mathbf{J} &= \arg \min_{\hat{\mathbf{x}}} \quad || \mathbf{H} \hat{\mathbf{x}} - \mathbf{y} ||^2_2 \\ &= \arg \min_{\hat{\mathbf{x}}} \quad \sum^m_{i = 1} \left( \mathbf{h}^T_i \hat{\mathbf{x}} - y_i \right)^2. \end{align} \]

The cost function \(\mathbf{J}\) can be rewritten as:

\[ \begin{align} \mathbf{J} &= \epsilon^2_{y1} + \ldots + \epsilon^2_{yk} = \boldsymbol{\epsilon}^T_{\mathbf{y}} \boldsymbol{\epsilon}_{\mathbf{y}} \\ &= (\mathbf{y} - \mathbf{H} \hat{\mathbf{x}})^T (\mathbf{y} - \mathbf{H} \hat{\mathbf{x}}) \\ &= (\mathbf{y}^T - \hat{\mathbf{x}}^T \mathbf{H}^T) (\mathbf{y} - \mathbf{H} \hat{\mathbf{x}}) \\ &= \mathbf{y}^T \mathbf{y} - \hat{\mathbf{x}}^T \mathbf{H}^T \mathbf{y} - \mathbf{y}^T \mathbf{H} \hat{\mathbf{x}} + \hat{\mathbf{x}}^T \mathbf{H}^T \mathbf{H} \hat{\mathbf{x}} \end{align} \]

or equivalently:

\[ \begin{align} &\arg \min_{\hat{\mathbf{x}}} \quad || \mathbf{H} \hat{\mathbf{x}} - \mathbf{y} ||^2_2 = \arg \min_{\hat{\mathbf{x}}} \quad \sum^m_{i = 1} \left( \mathbf{h}^T_i \hat{\mathbf{x}} - y_i \right)^2 \Rightarrow \\ &\arg \min_{\hat{\mathbf{x}}} \quad \left( \mathbf{H} \hat{\mathbf{x}} - \mathbf{y} \right)^T \left( \mathbf{H} \hat{\mathbf{x}} - \mathbf{y} \right) = \arg \min_{\hat{\mathbf{x}}} \quad \hat{\mathbf{x}}^T \mathbf{H}^T \mathbf{H} \hat{\mathbf{x}} - 2 \left( \mathbf{H}^T \mathbf{y} \right)^T \hat{\mathbf{x}} + \mathbf{y}^T \mathbf{y}. \end{align} \]

The optimization variable is \(\hat{\mathbf{x}}\) and \(\mathbf{J}\) is a quadratic convex function. Hence, the optimality condition is:

\[ \begin{align} \nabla_{\hat{\mathbf{x}}} \mathbf{J} &= \nabla \left( \mathbf{H} \hat{\mathbf{x}} - \mathbf{y} \right)^T \left( \mathbf{H} \hat{\mathbf{x}} - \mathbf{b} \right) = 0 \\ &\Rightarrow -\mathbf{y}^T \mathbf{H} - \mathbf{y}^T \mathbf{H} + 2 \hat{\mathbf{x}}^T \mathbf{H}^T \mathbf{H} \\ &\Rightarrow \mathbf{H}^T \mathbf{H} \hat{\mathbf{x}} = \mathbf{H}^T \mathbf{y}. \label{normal_equation} \end{align} \]

Equation (\(\ref{normal_equation}\)) is called the normal equation. Solving the least squares problem is equivalent to solving the normal equation. Since \(\mathbf{H}\) has a full column rank, \(\mathbf{H}^T \mathbf{H}\) is invertible. Hence, the optimal solution to the normal equation (equivalently to the least squares problem) is:

\[ \hat{\mathbf{x}} = \left( \mathbf{H}^T \mathbf{H} \right)^{-1} \mathbf{H}^T \mathbf{y}. \label{pseudoinverse} \]

The right hand side of equation (\(\ref{pseudoinverse}\)) is called the Moore-Penrose pseudoinverse and is defined as:

\[ \mathbf{H}^{+} = \left( \mathbf{H}^T \mathbf{H} \right)^{-1} \mathbf{H}^T. \]

The solution to the least squares is then \(\hat{\mathbf{x}} = \mathbf{H}^+ \mathbf{y}\). If \(\mathbf{H}\) is an invertible square matrix, then \(\mathbf{H}^+ = \mathbf{H}^{-1} \left( \mathbf{H}^T \right)^{-1} \mathbf{H}^T = \mathbf{H}^{-1}\).

Rather than computing the matrix inversion, it is more efficient to use SVD. Since \(\mathbf{H}^T \mathbf{H} = \mathbf{V} \boldsymbol{\Sigma}^2 \mathbf{V}^T\), we have \(\left ( \mathbf{H}^T \mathbf{H} \right)^{-1} = \mathbf{V} \boldsymbol{\Sigma}^{-2} \mathbf{V}^T\). Then:

\[ \left( \mathbf{H}^T \mathbf{H} \right)^{-1} \mathbf{H}^T = \mathbf{V} \boldsymbol{\Sigma}^{-2} \mathbf{V}^T \mathbf{V} \boldsymbol{\Sigma} \mathbf{U}^T = \mathbf{V} \boldsymbol{\Sigma}^{-1} \mathbf{U}^T. \]

Hence, the optimal solution is:

\[ \hat{\mathbf{x}} = \mathbf{V} \boldsymbol{\Sigma}^{-1} \mathbf{U}^T \mathbf{y}. \]

Example - Polynomial Fitting

Suppose we have \(m\) pairs of data points \((a_1, \ b_1), \ldots, (a_m, \ b_m)\) and we'd like to fit a cubic polynomial:

\[ p(a) = b = x_0 + x_1 a + x_2 a^2 + x_3 a^3. \]

The problem can be solved via least squares estimator. Let \(\mathbf{x} = \left[\begin{array}{ccc} x_0 & \ldots & x_3 \end{array} \right]\) be the coefficients of \(p(a)\). The residual can be written as:

\[ \begin{align} r(\mathbf{x}) &= \left[ \begin{array}{c} p(a_1) - b_1 \\ \vdots \\ p(a_m) - b_m \end{array} \right] = \left[ \begin{array}{c} p(a_1) \\ \vdots \\ p(a_m) \end{array} \right] - \left[ \begin{array}{c} b_1 \\ \vdots \\ b_m \end{array} \right] = \left[ \begin{array}{c} x_0 + x_1 a_1 + x_2 a^2_1 + x_3 a^3_1 \\ \vdots \\ x_0 + x_1 a_m + x_2 a^2_m + x_3 a^3_m \end{array} \right] - \left[ \begin{array}{c} b_1 \\ \vdots \\ b_m \end{array} \right] \\ &= \left[ \begin{array}{cccc} 1 & a_1 & a^2_1 & a^3_1 \\ \vdots & \ddots & \ddots & \vdots \\ 1 & a_m & a^2_m & a^3_m \end{array} \right] \left[ \begin{array}{c} x_0 \\ x_1 \\ x_2 \\ x_3 \end{array} \right] - \left[ \begin{array}{c} b_1 \\ \vdots \\ b_m \end{array} \right] \\ &= \mathbf{A} \mathbf{x} - \mathbf{b}. \end{align} \]

The matrix \(\mathbf{A}\) is called the Vandermonde matrix.