Least Squares

When there is no exact solution, find the best approximation

When Ax = b Has No Solution

Throughout linear algebra, we have been solving $A\vec{x} = \vec{b}$ . But what if no solution exists? This happens when $\vec{b}$ does not lie in the column space of $A$ .

The most common case is an overdetermined system: more equations than unknowns. You cannot draw a straight line through three random points—you have 3 equations but only 2 parameters (slope and intercept).

Overdetermined: Three points, two parameters

Three points, but we want a line y = mx + b (only 2 parameters). No single line passes through all three—the system is overdetermined.

When no exact solution exists, we settle for the next best thing: find the $\vec{x}$ that makes $A\vec{x}$ as close as possible to $\vec{b}$ .

Minimizing the Error

The least squares solution $\hat{x}$ minimizes the squared length of the residual vector:

\hat{x} = \arg\min_{\vec{x}} \|A\vec{x} - \vec{b}\|^2

Why squared? Because the squared norm is differentiable and has a unique minimum. It also heavily penalizes large errors, which is often desirable.

Interactive: Drag points to see the best-fit line

Best fit line

y = 0.95x + 0.67

Total squared error

0.107

Drag the points to see how the best-fit line adjusts. The red dashed lines show the residuals—the errors we are minimizing.

The red lines show the residuals—the vertical distances from points to the line. Least squares finds the line that minimizes the sum of squared residuals.

The Geometric View: Projection

Here is the key insight: since $A\vec{x}$ must lie in the column space of $A$ , the closest we can get to $\vec{b}$ is its orthogonal projection onto that column space.

The residual $\vec{b} - A\hat{x}$ must be perpendicular to the column space. This means it is perpendicular to every column of $A$ :

A^T(\vec{b} - A\hat{x}) = \vec{0}

3D view: Project b onto the column space

b (target)

Projection (Ax̂)

Residual

The column space (blue plane) contains all possible Ax. The least squares solution finds the point in this plane closest to b—its orthogonal projection.

The green point is the projection of $\vec{b}$ onto the column space (the blue plane). The orange vector is the residual, which is perpendicular to the plane.

The Normal Equations

Rearranging $A^T(\vec{b} - A\hat{x}) = \vec{0}$ gives the normal equations:

A^TA\hat{x} = A^T\vec{b}

If $A^TA$ is invertible (which happens when $A$ has full column rank), the solution is:

\hat{x} = (A^TA)^{-1}A^T\vec{b}

The matrix $(A^TA)^{-1}A^T$ is called the pseudoinverse of $A$ . It generalizes the inverse to non-square matrices.

Interactive: Solve the normal equations

a2.0

b1.0

b_x3.0

c1.0

d3.0

b_y2.0

\hat{x} = (1.40, 0.20)

Error: 0.000

The solution x̂ minimizes ||Ax - b||. The residual (red dashed line) is perpendicular to the column space.

Connection to Orthogonality

The normal equations encode the orthogonality condition. Multiplying by $A^T$ projects the residual onto the column space. When that projection is zero, the residual is perpendicular to the column space.

This connects back to our study of orthogonal projections. The projection matrix onto the column space of $A$ is:

P = A(A^TA)^{-1}A^T

Applying $P$ to $\vec{b}$ gives the projected point $A\hat{x}$ —the closest point in the column space.

Applications

Least squares is one of the most widely used techniques in applied mathematics:

Applications of least squares

Linear Regression

Fit a line (or hyperplane) to noisy data. Each data point gives one equation, but we have fewer parameters than points.

Predicting house prices from square footage, bedrooms, etc.

Whenever you have more measurements than parameters and want to find the best fit, least squares provides the answer. It is the mathematical foundation of regression, filtering, and estimation.

Numerical Considerations

While the normal equations are elegant, directly computing $(A^TA)^{-1}$ can be numerically unstable. In practice, better methods exist:

QR decomposition: Decompose A = QR, then solve Rx̂ = Q^T b
SVD: Use the pseudoinverse from the singular value decomposition

These methods are more stable when $A$ is nearly rank-deficient or ill-conditioned.

Key Takeaways

When $A\vec{x} = \vec{b}$ has no solution, we minimize $\|A\vec{x} - \vec{b}\|^2$
The least squares solution is the projection of $\vec{b}$ onto the column space of $A$
The residual is perpendicular to the column space
The normal equations: $A^TA\hat{x} = A^T\vec{b}$
Solution: $\hat{x} = (A^TA)^{-1}A^T\vec{b}$ (when full column rank)
Applications: regression, signal processing, GPS, computer vision, and countless other fields