Linear Regression Calculator

Calculate simple linear regression to model the relationship between two variables. Get the regression equation, statistical measures, and make predictions.

Y = a + bX where b = sum[(x-xbar)(y-ybar)]/sum(x-xbar)^2, a = ybar - b*xbar; R^2 = r^2; SE = sqrt[sum(y-yhat)^2/(n-2)]; t = b/SE_b
X: 1,2,3,4,5 and Y: 2,4,5,4,6 -> Y = 2.2 + 0.8X, R^2=69.2%, slope significant (p<0.05). Predict X=6 -> Y~=7.0

What is linear regression and when should I use it?

Linear regression models the relationship between one independent variable (X) and dependent variable (Y) using a straight line: Y = a + bX. Use when: (1) Predicting continuous outcomes (sales, price, weight), (2) Quantifying effect of X on Y, (3) Relationship is approximately linear. Example: Predict house price from size. Size=1500 sqft → Price=$200k. Components: Slope (b) = change in Y per unit X. Intercept (a) = Y when X=0. Not for: categorical outcomes, non-linear relationships (use polynomial/logistic regression instead).

How do I interpret the slope and intercept?

SLOPE (b): Change in Y for each 1-unit increase in X. Example: Salary = 30,000 + 5,000*Years. Slope=5,000 means each year of experience adds $5,000 salary. Positive slope = positive relationship, negative = inverse. INTERCEPT (a): Value of Y when X=0. In example, a=30,000 is starting salary with 0 years experience. Warning: Intercept may be meaningless if X=0 is outside data range. Example: Weight = -100 + 3*Height. Intercept -100 lbs at 0 inches is nonsensical (extrapolation error).

What is R-squared and how do I interpret it?

R^2 = proportion of variance in Y explained by X. Ranges 0-1 (or 0-100%). R^2=0.75 means X explains 75% of Y variation; 25% due to other factors. Interpretation: <0.3 weak model, 0.3-0.7 moderate, >0.7 strong. Example: Height predicts weight, R^2=0.64. Height explains 64% of weight differences; 36% from diet, muscle, bone density. Warning: High R^2 doesn't mean causation or good predictions. Low R^2 can still be useful (stock prediction R^2=0.05 is valuable!). Adjusted R^2 better for comparing models.

How do I make predictions using the regression equation?

Use equation Y = a + bX. Plug in X value to predict Y. Example: Sales = 1000 + 50*Ads. If Ads=20 → Sales = 1000 + 50(20) = 2000 units. Confidence interval: Y +/- t*SE where SE depends on distance from mean (further = wider interval). Extrapolation risk: Don't predict beyond data range. If data has Ads 10-30, predicting Ads=100 is unreliable (relationship may change). Always report: point estimate, confidence interval, R^2, sample size. Cross-validate predictions on new data.

What are the assumptions of linear regression?

LINEAR: Relationship must be linear (check scatter plot). INDEPENDENCE: Observations independent (no autocorrelation). HOMOSCEDASTICITY: Constant variance of residuals (check residual plot). NORMALITY: Residuals normally distributed (for inference, not prediction). NO OUTLIERS: Extreme values distort results. Violations: Linear → try polynomial/log transformation. Homoscedasticity → weighted regression. Normality → bootstrap, robust methods. Always plot: scatter plot (linearity), residual plot (patterns), Q-Q plot (normality), leverage plot (outliers).

What is the standard error and why does it matter?

Standard Error (SE) measures average prediction error. Smaller SE = better fit. Example: Price = 50,000 + 100*Size, SE=10,000. Predictions typically off by +/-$10,000. Used for: (1) Confidence intervals for predictions, (2) Hypothesis tests, (3) Comparing models (lower SE = better). Related: RMSE (root mean square error) = sqrt(Σresiduals^2/n). SE of slope: SE_b = SE/sqrtΣ(x-x̄)^2. Larger samples and tighter data (lower SD) give smaller SE. Report SE alongside R^2 for complete picture.

How do I test if the slope is statistically significant?

Test H₀: �^2=0 (no relationship) vs H₁: �^2≠0. T-statistic: t = b/SE_b with df=n-2. Example: b=5.2, SE_b=1.5, n=30 → t=5.2/1.5=3.47, df=28, p<0.01 (significant slope). Rule: |t|>2 roughly means p&lt;0.05. Confidence interval: b +/- t*SE_b. If CI includes 0 → not significant. Example: CI = [2.2, 8.2] excludes 0 → significant. Common mistake: R^2 can be low but slope still significant (weak but real relationship). Or R^2 high but not significant (small sample, high variability).

What is the difference between correlation and regression?

CORRELATION: Symmetric relationship strength (r). No cause/effect, no prediction. r_xy = r_yx. REGRESSION: Asymmetric prediction (Y from X). Gives equation, distinguishes predictor/outcome. Can be different: regress Y on X ≠ regress X on Y. Example: Height-Weight r=0.7 (correlation). Predict Weight from Height: W=−100+3H (regression). Relationship: slope = r*(SD_y/SD_x). Both have same R^2. Use correlation for: exploring relationships. Use regression for: prediction, effect sizes, control variables.

How do residuals help diagnose regression problems?

Residuals = observed Y - predicted Y = errors. Patterns indicate violations. RANDOM scatter = good fit. CURVED pattern = non-linear relationship (try polynomial). FUNNEL shape = heteroscedasticity (try log transformation, weighted regression). CLUSTERS = missing variables or subgroups. OUTLIERS = influential points (check Cook's distance). Example: Residuals vs X shows U-shape → quadratic term needed. Residuals vs fitted shows increasing spread → log(Y) transformation. Always plot residuals! Most problems invisible in R^2 alone.

What is the difference between simple and multiple regression?

SIMPLE: One predictor (Y = a + bX). Example: Predict salary from experience. MULTIPLE: Multiple predictors (Y = a + b₁X₁ + b₂X₂ + ...). Example: Predict salary from experience + education + location. Benefits of multiple: (1) Control confounds, (2) Better predictions (higher R^2), (3) Partial effects (b₁ = effect of X₁ holding X₂ constant). Example: Simple regression education→salary might confound with experience. Multiple regression separates effects. Caution: collinearity (correlated predictors) causes problems.

How do outliers and influential points affect regression?

OUTLIERS: Points far from regression line (large residuals). May indicate: data errors, special cases, or model inadequacy. INFLUENTIAL points: Points that strongly affect slope/intercept (high leverage). Detect with: Cook's distance >1, leverage >2p/n. Example: 20 points show r=0.1. Add outlier far from trend → r jumps to 0.8 (misleading!). Solutions: (1) Check for data errors, (2) Try robust regression, (3) Report with/without outlier, (4) Transform variables. Never silently remove - justify and report impact. One bad point can dominate entire analysis!

How do I validate my regression model?

INTERNAL validation: (1) Check R^2 and adjusted R^2, (2) Test significance (F-test, t-tests), (3) Plot residuals (patterns, outliers), (4) Check assumptions (normality, homoscedasticity), (5) Calculate RMSE, MAE. EXTERNAL validation: (1) Train/test split (70/30), (2) Cross-validation (k-fold), (3) Test on new data. Example: Build model on 2020 data (R^2=0.8), test on 2021 (R^2=0.6 → some overfitting). Report both training and test performance. Common issue: Model fits training data perfectly but predicts poorly (overfitting) → use simpler model or regularization.