Linear Regression
Fitting a line to data by minimizing the sum of squared residuals.
Linear Regression
Concept Overview
Linear regression is the foundation of statistical modeling and machine learning. It models the relationship between a dependent variable y and one or more independent variables x by fitting a straight line that minimizes the sum of squared differences between observed and predicted values. Despite its simplicity, linear regression remains one of the most widely used and interpretable models in science, engineering, and business.
Mathematical Definition
For simple linear regression with one predictor, the model is:
Ordinary Least Squares (OLS)
OLS finds the parameters that minimize the sum of squared residuals:
Key Concepts
R² (Coefficient of Determination)
R² measures the proportion of variance in y explained by the model:
R² ranges from 0 to 1. A value of 1 means a perfect fit; 0 means the model explains no variance beyond the mean. In the interactive, try increasing noise to see R² drop, or adding more data points to see the fit stabilize.
Residuals
Residuals are the vertical distances between each data point and the fitted line. OLS minimizes the sum of their squares — not their absolute values — which gives more weight to outliers. Toggle "Residuals" in the interactive to visualize these distances. Well-behaved residuals should appear randomly scattered with no visible pattern.
Assumptions
- Linearity: The true relationship between x and y is linear. Violations produce systematic patterns in residuals.
- Homoscedasticity: Residual variance is constant across all x values. Fan-shaped residuals indicate violation.
- Independence: Observations are independent of each other. Violated in time series without accounting for autocorrelation.
- Normality: Residuals are normally distributed. Less critical for large samples due to the Central Limit Theorem.
Multiple Linear Regression
The model extends naturally to multiple predictors:
The OLS solution remains the same in matrix form. With multiple predictors, multicollinearity (correlated predictors) can inflate coefficient variance. Regularization methods like Ridge (L2) and Lasso (L1) address this by adding penalty terms.
Historical Context
The method of least squares was independently developed by Adrien-Marie Legendre (1805) and Carl Friedrich Gauss (1809). Gauss used it to predict the orbit of the asteroid Ceres from limited observations — a dramatic public demonstration that established the technique's reputation. Francis Galton introduced the term "regression" in the 1880s while studying the tendency of children's heights to "regress" toward the population mean.
Today, linear regression remains the starting point for nearly every data analysis. Its interpretability — each coefficient directly measures the effect of a predictor — makes it indispensable in fields where understanding causation matters, from epidemiology to economics.
Real-world Applications
- Economics: Modeling the relationship between GDP, inflation, unemployment, and other macroeconomic variables.
- Medicine: Dose-response modeling, predicting patient outcomes from clinical measurements.
- Real estate: Predicting house prices from features like square footage, location, and number of bedrooms.
- Climate science: Estimating trends in temperature, sea level, and CO2 concentrations over time.
- Machine learning: Linear regression is a building block — logistic regression, neural networks, and regularized models all extend its core idea.
Related Concepts
- Gradient Descent — an alternative to the closed-form OLS solution, used when the dataset is too large for matrix inversion
- Central Limit Theorem — guarantees that OLS estimators are approximately normally distributed in large samples, enabling confidence intervals and hypothesis tests
- Eigenvalues & Eigenvectors — the eigenvalues of XTX determine the stability of the OLS solution; near-zero eigenvalues indicate multicollinearity
Experience it interactively
Adjust parameters, observe in real time, and build deep intuition with Riano’s interactive Linear Regression module.
Try Linear Regression on Riano →